Introduction
This is a Multiple Linear Regression Analysis with the R programming language that attempts to create a linear regression model to predict IMDb Ratings for movies using data from Rotten Tomatoes and The Academy for Motion Picture Arts & Sciences.
Load data
We are using a dataset from 2016 with 651 rows. Its codebook can be viewed here.
Research Question
Can we predict IMDb Movie Ratings using Rotten Tomatoes Ratings, movie genre, runtime, MPAA rating, theater debut year, Oscar awards/nominations, whether the film appear in BoxOfficeMojo’s Top 200 list?
Exploratory Data Analysis
We begin by verifying the information from our codebook by viewing the structure of the data then separating the variables by type.
## spc_tbl_ [651 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : chr [1:651] "Feature Film" "Feature Film" "Feature Film" "Feature Film" ...
## $ genre : chr [1:651] "Drama" "Drama" "Comedy" "Drama" ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : chr [1:651] "R" "PG-13" "R" "PG" ...
## $ studio : chr [1:651] "Indomina Media Inc." "Warner Bros. Pictures" "Sony Pictures Classics" "Columbia Pictures" ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : num [1:651] 899 12285 22381 35096 2386 ...
## $ critics_rating : chr [1:651] "Rotten" "Certified Fresh" "Certified Fresh" "Certified Fresh" ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : chr [1:651] "Upright" "Upright" "Upright" "Upright" ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : chr [1:651] "no" "no" "no" "no" ...
## $ best_pic_win : chr [1:651] "no" "no" "no" "no" ...
## $ best_actor_win : chr [1:651] "no" "no" "no" "yes" ...
## $ best_actress_win: chr [1:651] "no" "no" "no" "no" ...
## $ best_dir_win : chr [1:651] "no" "no" "no" "yes" ...
## $ top200_box : chr [1:651] "no" "no" "no" "no" ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
## - attr(*, "spec")=
## .. cols(
## .. title = col_character(),
## .. title_type = col_character(),
## .. genre = col_character(),
## .. runtime = col_double(),
## .. mpaa_rating = col_character(),
## .. studio = col_character(),
## .. thtr_rel_year = col_double(),
## .. thtr_rel_month = col_double(),
## .. thtr_rel_day = col_double(),
## .. dvd_rel_year = col_double(),
## .. dvd_rel_month = col_double(),
## .. dvd_rel_day = col_double(),
## .. imdb_rating = col_double(),
## .. imdb_num_votes = col_double(),
## .. critics_rating = col_character(),
## .. critics_score = col_double(),
## .. audience_rating = col_character(),
## .. audience_score = col_double(),
## .. best_pic_nom = col_character(),
## .. best_pic_win = col_character(),
## .. best_actor_win = col_character(),
## .. best_actress_win = col_character(),
## .. best_dir_win = col_character(),
## .. top200_box = col_character(),
## .. director = col_character(),
## .. actor1 = col_character(),
## .. actor2 = col_character(),
## .. actor3 = col_character(),
## .. actor4 = col_character(),
## .. actor5 = col_character(),
## .. imdb_url = col_character(),
## .. rt_url = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Character (no categories): title
director
actor1
actor2
actor3
actor4
actor5
imdb_url
rt_url
Categorical (factor): title_type
(nominal) genre
(nominal) mpaa_rating
(nominal) studio
(nominal) thtr_rel_year
(nominal) thtr_rel_month
(nominal)
thtr_rel_day
(nominal) dvd_rel_year
(nominal)
dvd_rel_month
(nominal) dvd_rel_day
(nominal)
critics_rating
(ordinal) audience_rating
(ordinal)
Binary: best_pic_nom
best_pic_win
best_actor_win
best_actress
best_dir_win
top200_box
Numeric: runtime
imdb_rating
imdb_num_votes
critics_score
audience_score
There are a number of changes we need to make to the data types to not only match the codebook, but to make the data more applicable for the model to be created.
Character (no categories) Variables
We start by addressing the character variables that represent no
categories. We know that title
, imdb_url
, and
rt_url
should be unique to every row so we can get rid of
them first.
Next, if we want any chance at including director
or
actor
into our model, we need to see how many unique values
there are first.
# Select all actor columns, unlist df to vector, count distinct actors
movies %>% select(actor1, actor2, actor3, actor4, actor5) %>% unlist() %>% n_distinct()
## [1] 2363
With 2,363 different actors in the data, we aren’t even going to attempt to classify actors, and choose to exclude them.
## [1] 533
And with 533 different directors in the data we are going to exclude this variable.
For the initial changes, we create a new data frame to leave the original dataset unchanged.
# New df without actors, directors, & URLs
movies_df <- select(movies, -actor1, -actor2, -actor3, -actor4, -actor5, -director, -imdb_url, -rt_url)
Categorical Variables
Nominal
The first variable of this type to examine is studio
to
see if the number of unique studios will succumb it to the same fate of
our Character Variables above, or if we can reduce the number of factor
dimensions to something more appropriate for the model.
## [1] 212
With 212 unique studios in the data we have the choice of leaving
studio
as a string that we can not include in the model, or
we can attempt to bin the studios together to reduce the dimensionality
that will be added to the model from incorporating it as a nominal
factor.
Let’s look at the top 30.
## # A tibble: 30 × 2
## studio n
## <chr> <int>
## 1 Paramount Pictures 37
## 2 Warner Bros. Pictures 30
## 3 Sony Pictures Home Entertainment 27
## 4 Universal Pictures 23
## 5 Warner Home Video 19
## 6 20th Century Fox 18
## 7 Miramax Films 18
## 8 MGM 16
## 9 Twentieth Century Fox Home Entertainment 14
## 10 IFC Films 13
## # ℹ 20 more rows
First we’ll try to identify subsidiaries to all of the
studio
parent organizations.
# Find string pattern in studio then show distinct
movies_df %>%
filter(grepl("paramount", studio, ignore.case = TRUE)) %>% distinct(studio)
## # A tibble: 6 × 1
## studio
## <chr>
## 1 Paramount Home Video
## 2 Paramount Pictures
## 3 Paramount Classics
## 4 Paramount
## 5 Paramount Vantage
## 6 Paramount Studios
## # A tibble: 4 × 1
## studio
## <chr>
## 1 Universal Pictures
## 2 MCA Universal Home Video
## 3 Universal Studios
## 4 Universal
## # A tibble: 8 × 1
## studio
## <chr>
## 1 Warner Bros. Pictures
## 2 Warner Home Video
## 3 WARNER BROTHERS PICTURES
## 4 Warner Bros.
## 5 Warner Bros Pictures
## 6 Warner Independent Pictures
## 7 Warner Independent
## 8 Warners Bros. Pictures
## # A tibble: 7 × 1
## studio
## <chr>
## 1 Sony Pictures Classics
## 2 Sony Pictures Home Entertainment
## 3 Sony Pictures Entertainment
## 4 Sony Pictures
## 5 Sony Pictures/Columbia
## 6 Sony Pictures/Screen Gems
## 7 Sony Entertainment
## # A tibble: 8 × 1
## studio
## <chr>
## 1 20th Century Fox
## 2 20th Century Fox Film Corporat
## 3 Twentieth Century Fox Home Entertainment
## 4 Fox Searchlight Pictures
## 5 Fox Atomic
## 6 20th Century Fox Film Corporation
## 7 Fox Searchlight
## 8 Fox
## # A tibble: 4 × 1
## studio
## <chr>
## 1 MGM/United Artists
## 2 MGM
## 3 MGM Home Entertainment
## 4 MGM/UA
## # A tibble: 2 × 1
## studio
## <chr>
## 1 Miramax Films
## 2 Miramax
## # A tibble: 8 × 1
## studio
## <chr>
## 1 Buena Vista Distribution Compa
## 2 Walt Disney Home Entertainment
## 3 Walt Disney Pictures
## 4 Buena Vista Pictures
## 5 Buena Vista Internationa
## 6 Walt Disney Productions
## 7 Disney
## 8 Buena Vista
Now we will group these together with their respective parent company.
# Find patterns for studio values and replace with parent studio
movies_df <- movies_df %>%
mutate(studio = case_when(grepl("Paramount", studio, ignore.case = TRUE) ~ "Paramount",
grepl("Universal", studio, ignore.case = TRUE) ~ "Universal",
grepl("Fox", studio, ignore.case = TRUE) ~ "Fox",
grepl("Warner", studio, ignore.case = TRUE) ~ "Warner",
grepl("Sony", studio, ignore.case = TRUE) ~ "Sony",
grepl("MGM", studio, ignore.case = TRUE) ~ "MGM",
grepl("Miramax", studio, ignore.case = TRUE) ~ "Miramax",
grepl("Disney|Buena Vista", studio, ignore.case = TRUE) ~ "Disney",
TRUE ~ studio)
)
movies_df %>% count(studio) %>% arrange(desc(n)) %>% .[1:30,]
## # A tibble: 30 × 2
## studio n
## <chr> <int>
## 1 Warner 71
## 2 Paramount 60
## 3 Sony 55
## 4 Fox 51
## 5 Universal 39
## 6 MGM 27
## 7 Disney 25
## 8 Miramax 25
## 9 IFC Films 13
## 10 New Line Cinema 10
## # ℹ 20 more rows
Now, with these studios binned together with their parent
organizations, we can address the 8 NA
values for
studio.
## # A tibble: 8 × 24
## title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 Dirt… Documenta… Come… 94 R <NA> 2007 9
## 2 Cave… Feature F… Acti… 91 PG <NA> 1981 4
## 3 Atta… TV Movie Other 90 R <NA> 1993 12
## 4 Oliv… Feature F… Anim… 74 G <NA> 1988 11
## 5 The … Feature F… Drama 97 R <NA> 2001 10
## 6 Inse… Feature F… Drama 117 NC-17 <NA> 1975 1
## 7 Inbr… Feature F… Art … 90 R <NA> 2011 8
## 8 Deat… Feature F… Horr… 87 R <NA> 1972 1
## # ℹ 16 more variables: thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## # dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## # imdb_num_votes <dbl>, critics_rating <chr>, critics_score <dbl>,
## # audience_rating <chr>, audience_score <dbl>, best_pic_nom <chr>,
## # best_pic_win <chr>, best_actor_win <chr>, best_actress_win <chr>,
## # best_dir_win <chr>, top200_box <chr>
With a quick internet search for the 8 titles, we’ve replaced the
NA
values.
# Insert studio values found on web search
movies_df <- mutate(movies_df,
studio = case_when(title == "Dirty Sanchez: The Movie" ~ "MTV",
title == "Caveman" ~ "Turman-Foster Company",
title == "Attack of the 50 Foot Woman" ~ "Woolner Brothers",
title == "Oliver & Company" ~ "Disney",
title == "The Man Who Sued God" ~ "Disney",
title == "Inserts" ~ "United Artists",
title == "Inbred" ~ "New Flesh Films",
title == "Death Line (Raw Meat)" ~ "American International Pictures",
TRUE ~ studio
)
)
One last look at the top 20.
## # A tibble: 20 × 2
## studio n
## <chr> <int>
## 1 Warner 71
## 2 Paramount 60
## 3 Sony 55
## 4 Fox 51
## 5 Universal 39
## 6 Disney 27
## 7 MGM 27
## 8 Miramax 25
## 9 IFC Films 13
## 10 New Line Cinema 10
## 11 Magnolia Pictures 9
## 12 HBO Video 8
## 13 Columbia Pictures 7
## 14 Lions Gate Films 7
## 15 New Line Home Entertainment 7
## 16 The Weinstein Company 7
## 17 United Artists 7
## 18 First Run Features 6
## 19 Orion Home Video 5
## 20 Focus Features 4
We will put the top 8 into a vector named major_studios
and then change all other studio
values to “Other”.
# Put top 8 studios into a character vector
major_studios <- movies_df %>% count(studio) %>% arrange(desc(n)) %>% .[1:8,1] %>% pull()
# Change studio value to "Other" if not in the top 8 studios
movies_df <- mutate(movies_df, studio = if_else(grepl(paste0(major_studios, collapse="|"),
studio, ignore.case = TRUE), studio, "Other"))
We can consider the studio
variable ready to change to a
nominal factor with 9 levels into the model.
The next categorical variables we want to assess for categorical data
typing is thtr_rel_year
and dvd_rel_year.
This
is a question of whether there is a linear relationship between
thtr_rel_year
/dvd_rel_year
and
imdb_rating
(the response variable). If we find a
relationship over many years that imdb_rating
increases/decreases linearly, then we would change our data type to
numeric. On the other hand if we find that not to be the case (which we
suspect it won’t), then these yearly variables could be considered for
nominal categories by each year, or levels with multiple year spans.
First, we check for any NA
values present in these 6
time variables. To get more clarity on the rows with NA
values, we add title and studio from the original data.
movies %>% select(title, studio, thtr_rel_year, thtr_rel_month, thtr_rel_day,
dvd_rel_year, dvd_rel_month, dvd_rel_day) %>%
filter(rowSums(is.na(select(., -studio))) > 0)
## # A tibble: 8 × 8
## title studio thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Charlie: The Li… Warne… 2004 2 13 NA
## 2 Streets of Gold Live … 1986 11 14 NA
## 3 The Squeeze HBO V… 1987 7 10 NA
## 4 Electric Dreams MGM 1984 7 20 NA
## 5 Porky's Revenge 20th … 1985 3 22 NA
## 6 Teen Wolf Too Param… 1987 11 20 NA
## 7 The Last Remake… MCA U… 1977 7 15 NA
## 8 Let It Be Unite… 1970 5 20 NA
## # ℹ 2 more variables: dvd_rel_month <dbl>, dvd_rel_day <dbl>
These 8 titles appear to be TV productions or distributed by Home
Video studios. We are going to use our discretion to copy the time
values from the thtr_rel_
variables to the
dvd_rel_
variables for these 8 specific titles.
# The titles with DVD release times to be updated
titles <- c("Charlie: The Life and Art of Charles Chaplin",
"Streets of Gold",
"The Squeeze",
"Electric Dreams",
"Porky's Revenge",
"Teen Wolf Too",
"The Last Remake of Beau Geste",
"Let It Be")
# Replace dvd_rel values with thtr_rel values for titles in titles vector
movies_df <- movies_df %>%
mutate(dvd_rel_year = if_else(title %in% titles, thtr_rel_year, dvd_rel_year),
dvd_rel_month = if_else(title %in% titles, thtr_rel_month, dvd_rel_month),
dvd_rel_day = if_else(title %in% titles, thtr_rel_day, dvd_rel_day))
Verify that there are no more NA values in these columns.
# Select 8 time columns and filter for any rows with an NA
movies_df %>% select(thtr_rel_year, thtr_rel_month, thtr_rel_day,
dvd_rel_year, dvd_rel_month, dvd_rel_day) %>%
filter(rowSums(is.na(.)) > 0)
## # A tibble: 0 × 6
## # ℹ 6 variables: thtr_rel_year <dbl>, thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## # dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>
Assessing linear associations for time variables.
movies_df %>% ggplot(aes(x = thtr_rel_year, y = imdb_rating)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
theme_bw()
movies_df %>% ggplot(aes(x = dvd_rel_year, y = imdb_rating)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
theme_bw()
We see no linear relationship from year values and the response variable, therefore if we wish to leave these variables in the model, they must be typed to nominal factors. Note that ordinal factors should be typed to numeric variables, not factors. We also see higher variations in lower scores indicating that our response variable is skewed to the left.
We will now perform the same test for months and days.
imdb_rating ~ thtr_rel_month
movies_df %>% ggplot(aes(x = thtr_rel_month, y = imdb_rating)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
scale_x_continuous(breaks = 1:12, labels = month.abb) +
theme_bw()
imdb_rating ~ dvd_rel_month
movies_df %>% ggplot(aes(x = dvd_rel_month, y = imdb_rating)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
scale_x_continuous(breaks = 1:12, labels = month.abb) +
theme_bw()
imdb_rating ~ thtr_rel_day
movies_df %>% ggplot(aes(x = thtr_rel_day, y = imdb_rating)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
scale_x_continuous(breaks = 1:31) +
theme_bw()
imdb_rating ~ dvd_rel_day
movies_df %>% ggplot(aes(x = dvd_rel_day, y = imdb_rating)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
scale_x_continuous(breaks = 1:31) +
theme_bw()
We conclude that all 6 of these time variables must be nominal factors and not numeric variables, if we wish to include them in the model.
We wish to convert the yearly variables into decade categories to reduce dimensionality.
movies_df <- movies_df %>%
mutate(thtr_rel_decade = paste0((thtr_rel_year %/% 10) * 10, "'s"),
.before = thtr_rel_year) %>%
mutate(dvd_rel_decade = paste0((dvd_rel_year %/% 10) * 10, "'s"),
.before = dvd_rel_year) %>%
select(-thtr_rel_year, -dvd_rel_year)
Now we’ll convert all 6 to factors.
# Mutate variables with as.factor
movies_df <- movies_df %>% mutate(thtr_rel_decade = as.factor(thtr_rel_decade),
thtr_rel_month = as.factor(thtr_rel_month),
thtr_rel_day = as.factor(thtr_rel_day),
dvd_rel_decade = as.factor(dvd_rel_decade),
dvd_rel_month = as.factor(dvd_rel_month),
dvd_rel_day = as.factor(dvd_rel_day)
)
The next group of variables: title_type
,
genre
, and mpaa_rating
we are going to type as
nominal categories by using a summary table.
This custom function, unique_with_counts()
will return a
summary table that shows us the counts for each distinct value,
including NA
values.
unique_with_counts <- function(data) {
# Getting the unique values and counts for each column
list_columns <- lapply(data, function(col) {
tbl <- table(col, useNA = "ifany")
tibble(value = as.vector(names(tbl)), count = as.vector(tbl))
})
# Getting the maximum number of unique values (including NA) in any column
max_length <- max(sapply(list_columns, nrow))
# Making all tibbles the same length (with NA for missing)
list_columns <- lapply(list_columns, function(tbl) {
add_rows <- max_length - nrow(tbl)
if (add_rows > 0) {
tbl <- rbind(tbl, tibble(value = rep(NA, add_rows), count = rep(NA, add_rows)))
}
arrange(tbl, desc(tbl[,2]))
})
# Binding all the tibbles column-wise
output <- do.call(cbind, list_columns)
# Creating new column names
new_col_names <- c(rbind(paste0(names(data), "_value"), paste0(names(data), "_count")))
colnames(output) <- as.vector(new_col_names)
# Returning the final data frame
return(output)
}
## title_type_value title_type_count genre_value genre_count
## 1 Feature Film 591 Drama 305
## 2 Documentary 55 Comedy 87
## 3 TV Movie 5 Action & Adventure 65
## 4 <NA> NA Mystery & Suspense 59
## 5 <NA> NA Documentary 52
## 6 <NA> NA Horror 23
## 7 <NA> NA Other 16
## 8 <NA> NA Art House & International 14
## 9 <NA> NA Musical & Performing Arts 12
## 10 <NA> NA Animation 9
## 11 <NA> NA Science Fiction & Fantasy 9
## mpaa_rating_value mpaa_rating_count
## 1 R 329
## 2 PG-13 133
## 3 PG 118
## 4 Unrated 50
## 5 G 19
## 6 NC-17 2
## 7 <NA> NA
## 8 <NA> NA
## 9 <NA> NA
## 10 <NA> NA
## 11 <NA> NA
We can see there are no NA
values present here.
title_type
looks good, genre
doesn’t appear to
have any similar or repeating levels, and mpaa_rating
has 2
“NC-17” observations which we will change to “R”.
# Change "NC-17" values to "R"
movies_df <- movies_df %>% mutate(mpaa_rating = if_else(mpaa_rating == "NC-17", "R", mpaa_rating))
# Mutate variables with as.factor
movies_df <- movies_df %>% mutate(title_type = as.factor(title_type),
genre = as.factor(genre),
mpaa_rating = as.factor(mpaa_rating)
)
Ordinal Categories
Our final 2 categorical variables represent ordinal categories.
Let’s view their summary.
## critics_rating_value critics_rating_count audience_rating_value
## 1 Rotten 307 Upright
## 2 Fresh 209 Spilled
## 3 Certified Fresh 135 <NA>
## audience_rating_count
## 1 376
## 2 275
## 3 NA
These look ready to convert with ordinary/integer encoding.
Let’s look at how critics_rating
is derived.
Icon | Score | Description |
---|---|---|
100-75% | Certified Fresh: Wide-release films with a score of 75% or higher that are reviewed by at least 80 critics, of whom 5 are “Top Critics”, are given this seal. The “Certified Fresh” seal remains until the score drops below 70%. Films with limited releases require only 40 reviews (including 5 from “Top Critics”) to qualify for this seal. For TV shows, only individual seasons are eligible for consideration, and each must have at least 20 critic reviews. | |
100-60% | Fresh: Films or TV shows with a score of 60% or higher that do not meet the requirements for the “Certified Fresh” seal. | |
59-0% | Rotten: Films or TV shows with a score of 59% or lower receive this seal. |
Since the 3 levels of critics_rating
are not equidistant
and are weighted by types of critics it would not be appropriate to
convert these to numerical values with ordinary/integer encoding.
Instead we want to leave them as a ordinal categorical factors in order
for them each to get their own reference level in the model. Since I was
not able to find information on how audience_rating
was
derived (other than it being calculated from
audience_score
) it too will be converted to an ordinal
factor.
movies_df <- movies_df %>%
mutate(critics_rating = factor(critics_rating, levels = c("Rotten", "Fresh", "Certified Fresh")),
audience_rating = as.factor(audience_rating)
)
Binary Variables
We use our summary table to show the 6 “yes/no” value variables.
select(movies_df, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box) %>%
unique_with_counts()
## best_pic_nom_value best_pic_nom_count best_pic_win_value best_pic_win_count
## 1 no 629 no 644
## 2 yes 22 yes 7
## best_actor_win_value best_actor_win_count best_actress_win_value
## 1 no 558 no
## 2 yes 93 yes
## best_actress_win_count best_dir_win_value best_dir_win_count top200_box_value
## 1 579 no 608 no
## 2 72 yes 43 yes
## top200_box_count
## 1 636
## 2 15
We wish to convert from categories to numeric values (0,1) with ordinary/integer encoding.
movies_df <- movies_df %>%
mutate(best_pic_nom = if_else(best_pic_nom == "yes", 1, 0),
best_pic_win = if_else(best_pic_win == "yes", 1, 0),
best_actor_win = if_else(best_actor_win == "yes", 1, 0),
best_actress_win = if_else(best_actress_win == "yes", 1, 0),
best_dir_win = if_else(best_dir_win == "yes", 1, 0),
top200_box = if_else(top200_box == "yes", 1, 0)
)
Numeric Variables
Since our summary table is not practical for numeric variables, we
will filter for NA
values manually.
# Filter any rows with an NA numeric column
movies_df %>% select(title, runtime, imdb_rating, imdb_num_votes, critics_score, audience_score) %>%
filter(rowSums(is.na(.)) > 0)
## # A tibble: 1 × 6
## title runtime imdb_rating imdb_num_votes critics_score audience_score
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 The End of Am… NA 7.5 739 80 72
The runtime of 74 minutes for the movie, “The End of America”, can be found here.
We will view the distributoin of each numeric variable with a historgram.
runtime
imdb_rating
imdb_num_votes
critics_score
audience_score
ggplot(movies_df, aes(x = runtime)) +
geom_histogram(binwidth = 1,
color = "red", alpha = 0.7) +
labs(title = paste("Histogram of runtime")) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Let’s look closer at the outliers on both sides.
# runtime outliers in original data to retrieve rt_url again
movies %>% select(title, runtime, rt_url) %>% filter(runtime < 60 | runtime > 180)
## # A tibble: 5 × 3
## title runtime rt_url
## <chr> <dbl> <chr>
## 1 Africa: The Serengeti 39 //www.rottentomato…
## 2 Hotel Terminus: The Life and Times of Klaus Barbie 267 //www.rottentomato…
## 3 The Godfather, Part II 202 //www.rottentomato…
## 4 Sea Monsters: A Prehistoric Adventure 40 //www.rottentomato…
## 5 Titanic 194 //www.rottentomato…
All of these runtime
values are accurate.
ggplot(movies_df, aes(x = imdb_rating)) +
geom_histogram(binwidth = 0.1, fill = "red", color = "black", alpha = 0.7,
bins=length(unique(movies_df$imdb_rating))) +
labs(title = "Histogram of imdb_rating") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Let’s verify the outliers again.
# imdb_rating outliers in original data to retrieve imdb_url again
movies %>% filter(imdb_rating < 4 | imdb_rating > 8.8) %>% select(title, imdb_rating, imdb_url) %>%
arrange(imdb_rating)
## # A tibble: 21 × 3
## title imdb_rating imdb_url
## <chr> <dbl> <chr>
## 1 Disaster Movie 1.9 http://www.imdb.com/titl…
## 2 Epic Movie 2.3 http://www.imdb.com/titl…
## 3 Battlefield Earth 2.4 http://www.imdb.com/titl…
## 4 Viva Knievel! 2.7 http://www.imdb.com/titl…
## 5 Doogal 2.8 http://www.imdb.com/titl…
## 6 Teen Wolf Too 3.1 http://www.imdb.com/titl…
## 7 Jack and Jill 3.4 http://www.imdb.com/titl…
## 8 Carnosaur 3.4 http://www.imdb.com/titl…
## 9 Dumb and Dumberer: When Harry Met Lloyd 3.4 http://www.imdb.com/titl…
## 10 Galaxina 3.4 http://www.imdb.com/titl…
## # ℹ 11 more rows
While not exact, due to the fluidity of this variable in real time and when this data was collected, but they are all in very comparable ranges.
We wish to remove the ratings lower than 4.0 from the data to lower the skew.
ggplot(movies_df, aes(x = imdb_num_votes)) +
geom_histogram(binwidth = 100, fill = "red", color = "red", alpha = 0.7) +
labs(title = "Histogram of imdb_num_votes") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
This amount of skew requires a closer look.
ggplot(movies_df %>% filter(imdb_num_votes < 50000), aes(x = imdb_num_votes)) +
geom_histogram(binwidth = 100, fill = "red", alpha = 0.7) +
labs(title = "Histogram of imdb_num_votes") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
## # A tibble: 10 × 24
## title title_type genre runtime mpaa_rating studio thtr_rel_decade
## <chr> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 Django Unchained Feature F… Other 165 R Other 2010's
## 2 The Hangover Feature F… Come… 100 R Warner 2000's
## 3 V for Vendetta Feature F… Drama 132 R Warner 2000's
## 4 The Godfather, P… Feature F… Myst… 202 R Param… 1970's
## 5 Memento Feature F… Myst… 113 R Other 2000's
## 6 Good Will Hunting Feature F… Drama 126 R Miram… 1990's
## 7 The Prestige Feature F… Drama 130 PG-13 Disney 2000's
## 8 No Country for O… Feature F… Drama 122 R Miram… 2000's
## 9 Titanic Feature F… Drama 194 PG-13 Param… 1990's
## 10 The Hunger Games Feature F… Drama 142 PG-13 Other 2010's
## # ℹ 17 more variables: thtr_rel_month <fct>, thtr_rel_day <fct>,
## # dvd_rel_decade <fct>, dvd_rel_month <fct>, dvd_rel_day <fct>,
## # imdb_rating <dbl>, imdb_num_votes <dbl>, critics_rating <fct>,
## # critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## # best_pic_nom <dbl>, best_pic_win <dbl>, best_actor_win <dbl>,
## # best_actress_win <dbl>, best_dir_win <dbl>, top200_box <dbl>
We can verify this skew by these popular films and apply a few log
transformations to imdb_num_votes
to reduce the skew.
ggplot(movies_df, aes(x = log(imdb_num_votes))) +
geom_histogram(binwidth = .1, fill = "red", color = "black", alpha = 0.7) +
labs(title = "Histogram of imdb_num_votes") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(movies_df, aes(x = log2(imdb_num_votes))) +
geom_histogram(binwidth = .2, fill = "red", color = "black", alpha = 0.7) +
labs(title = "Histogram of imdb_num_votes") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(movies_df, aes(x = log10(imdb_num_votes))) +
geom_histogram(binwidth = .05, fill = "red", color = "black", alpha = 0.7) +
labs(title = "Histogram of imdb_num_votes") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
The base log transformation appears to depict the most normal distribution.
ggplot(movies_df, aes(x = critics_score)) +
geom_histogram(binwidth = 1, fill = "red", color = "black", alpha = 0.7) +
labs(title = "Histogram of critics_score") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
critics_score
is not skewed.
ggplot(movies_df, aes(x = audience_score)) +
geom_histogram(binwidth = 1, fill = "red", color = "black", alpha = 0.7) +
labs(title = "Histogram of audience_score") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
The skew in audience_score
is bearable for our
purposes.
Final Variable Assessment
Are there any rows with an NA value left?
## # A tibble: 0 × 24
## # ℹ 24 variables: title <chr>, title_type <fct>, genre <fct>, runtime <dbl>,
## # mpaa_rating <fct>, studio <fct>, thtr_rel_decade <fct>,
## # thtr_rel_month <fct>, thtr_rel_day <fct>, dvd_rel_decade <fct>,
## # dvd_rel_month <fct>, dvd_rel_day <fct>, imdb_rating <dbl>,
## # imdb_num_votes <dbl>, critics_rating <fct>, critics_score <dbl>,
## # audience_rating <fct>, audience_score <dbl>, best_pic_nom <dbl>,
## # best_pic_win <dbl>, best_actor_win <dbl>, best_actress_win <dbl>, …
Are there any duplicated rows?
## # A tibble: 1 × 24
## title title_type genre runtime mpaa_rating studio thtr_rel_decade
## <chr> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 Man on Wire Documentary Documentary 94 PG-13 Other 2000's
## # ℹ 17 more variables: thtr_rel_month <fct>, thtr_rel_day <fct>,
## # dvd_rel_decade <fct>, dvd_rel_month <fct>, dvd_rel_day <fct>,
## # imdb_rating <dbl>, imdb_num_votes <dbl>, critics_rating <fct>,
## # critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## # best_pic_nom <dbl>, best_pic_win <dbl>, best_actor_win <dbl>,
## # best_actress_win <dbl>, best_dir_win <dbl>, top200_box <dbl>
Drop the 1 duplicated row.
Final review of all variable types for the model.
## tibble [630 × 24] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:630] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:630] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 5 levels "G","PG","PG-13",..: 4 3 4 2 4 5 3 4 5 5 ...
## $ studio : Factor w/ 9 levels "Disney","Fox",..: 5 9 7 5 5 5 6 3 5 5 ...
## $ thtr_rel_decade : Factor w/ 5 levels "1970's","1980's",..: 5 4 3 3 4 4 2 3 5 5 ...
## $ thtr_rel_month : Factor w/ 12 levels "1","2","3","4",..: 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : Factor w/ 31 levels "1","2","3","4",..: 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_decade : Factor w/ 5 levels "1970's","1980's",..: 5 4 4 4 4 5 4 4 5 5 ...
## $ dvd_rel_month : Factor w/ 12 levels "1","2","3","4",..: 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : Factor w/ 31 levels "1","2","3","4",..: 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:630] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : num [1:630] 6.8 9.42 10.02 10.47 7.78 ...
## $ critics_rating : Factor w/ 3 levels "Rotten","Fresh",..: 1 3 3 3 1 2 1 1 2 3 ...
## $ critics_score : num [1:630] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:630] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
## $ best_pic_win : num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
## $ best_actor_win : num [1:630] 0 0 0 1 0 0 0 1 0 0 ...
## $ best_actress_win: num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
## $ best_dir_win : num [1:630] 0 0 0 1 0 0 0 0 0 0 ...
## $ top200_box : num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
This concludes our EDA and Data Wrangling of all variables in the data.
Multicollinearity of Independent Variables & Linear Relations to Y
We’re going to start with 2 custom functions to create a correlation heat map for all variables (numeric or factor).
calculate_association
takes any 2 variables and performs
a series of else if
statements to calculate
association/correlation.
If both variables are identical, returns 1.
If both variables are numeric, returns Pearson correlation.
If variables are 1 factor & 1 numeric,
3.1. If factor = 2 levels, returns point-biserial correlation.
3.2. If factor > 2 levels, returns eta-squared of ANOVA.
If both variables are factors (any levels), returns Cramer’s V correlation.
calculate_association <- function(x, y) {
if (identical(x, y)) { # TEST 1
return(c("identity" = 1))
} else if (is.numeric(x) && is.numeric(y)) { # TEST 2
return(c("pearson" = cor(x, y, use = "pairwise.complete.obs")))
} else if ((is.numeric(x) && is.factor(y)) || (is.factor(x) && is.numeric(y))) { # TEST 3
# Swap if necessary to ensure x is factor and y is numeric
if (is.numeric(x)) {
temp <- x
x <- y
y <- temp
}
if (length(levels(x)) == 2) { # TEST 3.1
# Point-Biserial Correlation for binary factor
return(c("point-biserial" = cor(y, as.integer(x) - 1)))
} else { # TEST 3.2
# Use eta-squared for factors with more than two levels
anova_result <- summary(aov(y ~ x))
sum_sq <- anova_result[[1]][["Sum Sq"]]
eta_squared <- sum_sq[1] / sum(sum_sq)
return(c("eta-squared" = eta_squared))
}
} else { # TEST 4
# Assuming both variables are factors
tbl <- table(x, y)
chi_squared <- chisq.test(tbl)$statistic
n <- sum(tbl)
k <- ncol(tbl)
r <- nrow(tbl)
cramers_v <- sqrt((chi_squared / n) / min(k - 1, r - 1))
return(c("cramer's-v" = cramers_v))
}
}
custom_corr_plot
takes a data frame and iterates each
pair of variables through the calculate_association
function and then melts the matrix to a long format suitable for
ggplot
, and plotted with geom_tile
. Text
labels of correlation values are removed if = 1, black if > 0.5 or
< -0.5, else gray.
custom_corr_plot <- function(data) {
var_names <- names(data)
n <- length(var_names)
assoc_matrix <- matrix(NA, ncol=n, nrow=n)
for (i in 1:n) {
for (j in 1:n) {
assoc_matrix[i, j] <- calculate_association(data[[var_names[i]]], data[[var_names[j]]])
}
}
colnames(assoc_matrix) <- var_names
rownames(assoc_matrix) <- var_names
# Melt the matrix
melted <- reshape2::melt(assoc_matrix)
# Plot the heatmap
p <- ggplot(data = melted, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Association") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 10, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_blank()) +
coord_fixed()
# Add text labels for associations >= 0.5 or <= -0.5, but not equal to 1, in black
p <- p + geom_text(data = subset(melted, (value >= 0.5 | value <= -0.5) & value != 1),
aes(x = Var2, y = Var1, label = round(value, 2)),
size = 4, color = "black")
# Add text labels for associations < 0.5 and > -0.5 and not equal to 1, in gray
p <- p + geom_text(data = subset(melted, value < 0.5 & value > -0.5 & value != 1),
aes(x = Var2, y = Var1, label = round(value, 2)),
size = 4, color = "darkgray")
return(p)
}
The custom correlation plot reveals the release date variables have the lowest association with the response variable, followed by studio.
There is also very high coliniarity between
(critics_rating
& audience_rating
),
(critics_score
& critics_rating
), as well
as (audience_score
& critics_score
).
We know the audience_rating
and
critics_rating
are derived from audience_score
and critics_score
respectively. And that
audience_rating
is a 2 level factor, while
critics_score
is a 3 level factor.
Since critics_score
has 3 levels, and we actually have
some clarity on how it is calculated, unlike
audience_rating
, we will omit audience_rating
and critics_score
from the model.
movies_df <- movies_df %>% select(-thtr_rel_decade, -thtr_rel_month, -thtr_rel_day,
-dvd_rel_decade, -dvd_rel_month, -dvd_rel_day,
-audience_rating, -critics_score)
Modeling
We are going to build a Multiple Linear Regression model using the backwardation method to optimize the p-values of the independent variables. At the end we will pick the model with the highest \(R^{2}_{\text{adj}}\).
Dependent Variable
imdb_rating
(numeric: 1.0 - 10.0)
Independent Variables
title_type
(categorical: 3 levels)genre
(categorical: 11 levels)runtime
(numeric)mpaa_rating
(categorical: 5 levels)critics_rating
(categorical: 3 levels)audience_score
(numeric)studio
(categorical: 9 levels)imdb_num_votes
(numeric (log))best_pic_nom
(binary)best_pic_win
(binary)best_actor_win
(binary)best_actress_win
(binary)best_dir_win
(binary)top200_box
(binary)
We are going to split the training and test sets at a 70/30 ratio.
df <- movies_df %>% select(-title)
set.seed(321)
sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))
train <- df[sample, ]
test <- df[!sample, ]
Full Model
# Full model has 14 independent variables
full_model <- lm(imdb_rating ~ ., data = train)
# Model summary
summary(full_model)
##
## Call:
## lm(formula = imdb_rating ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.58045 -0.19563 0.01808 0.22459 0.96109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.790920 0.242004 15.665 < 2e-16 ***
## title_typeFeature Film -0.361593 0.150837 -2.397 0.016967 *
## title_typeTV Movie -0.066415 0.302173 -0.220 0.826143
## genreAnimation -0.703019 0.206352 -3.407 0.000722 ***
## genreArt House & International 0.306149 0.150763 2.031 0.042935 *
## genreComedy -0.116980 0.076940 -1.520 0.129185
## genreDocumentary 0.290625 0.161918 1.795 0.073408 .
## genreDrama 0.114183 0.069190 1.650 0.099654 .
## genreHorror 0.114201 0.112601 1.014 0.311081
## genreMusical & Performing Arts 0.133706 0.144280 0.927 0.354623
## genreMystery & Suspense 0.160031 0.085144 1.880 0.060881 .
## genreOther -0.008754 0.136196 -0.064 0.948781
## genreScience Fiction & Fantasy 0.023126 0.153195 0.151 0.880085
## runtime 0.003823 0.001036 3.691 0.000254 ***
## mpaa_ratingPG -0.126689 0.140587 -0.901 0.368042
## mpaa_ratingPG-13 -0.156071 0.147017 -1.062 0.289049
## mpaa_ratingR -0.065670 0.142603 -0.461 0.645399
## mpaa_ratingUnrated -0.112043 0.156357 -0.717 0.474040
## studioFox -0.065947 0.127433 -0.518 0.605083
## studioMGM -0.219163 0.141272 -1.551 0.121591
## studioMiramax 0.047381 0.140632 0.337 0.736355
## studioOther -0.006358 0.116212 -0.055 0.956397
## studioParamount -0.036883 0.125350 -0.294 0.768725
## studioSony 0.049284 0.129577 0.380 0.703886
## studioUniversal 0.013468 0.135414 0.099 0.920823
## studioWarner 0.002171 0.123915 0.018 0.986031
## imdb_num_votes 0.061954 0.015397 4.024 6.82e-05 ***
## critics_ratingFresh 0.275438 0.047344 5.818 1.20e-08 ***
## critics_ratingCertified Fresh 0.308775 0.061666 5.007 8.23e-07 ***
## audience_score 0.031330 0.001305 24.015 < 2e-16 ***
## best_pic_nom 0.023351 0.115398 0.202 0.839746
## best_pic_win -0.015571 0.248181 -0.063 0.950003
## best_actor_win 0.033094 0.053414 0.620 0.535879
## best_actress_win 0.029660 0.058730 0.505 0.613815
## best_dir_win 0.046658 0.074883 0.623 0.533578
## top200_box -0.050425 0.120443 -0.419 0.675685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3699 on 409 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8356
## F-statistic: 65.49 on 35 and 409 DF, p-value: < 2.2e-16
# Predictions
predictions <- predict(full_model, newdata = test)
# Residuals
residuals <- test$imdb_rating - predictions
# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error: 0.412579536589289"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100
# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared: 0.817427989671253"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(full_model$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared: 0.80239264764418"
# Table to store results
results <- tibble("model" = NA, "rmse" = NA, "pct_rmse" = NA, "rsq" = NA, "adj_rsq" = NA)
# Store results
results <- rbind(results, c("Full Model", rmse, pct_rmse, rsq, adj_rsq)) %>% drop_na()
The full model has a RMSE of 0.41, meaning our mean error for
predictions of imdb_rating
are 0.41, the Adjusted R-Squared
for this model was 0.80, meaning 80% of the variations in the data can
be explained by this model.
studio
and mpaa_rating
have no significant
levels in the model and will be removed for the next model.
Model 2
# Model 2 has 12 independent variables
model2 <- lm(imdb_rating ~ . ,
data = train %>% select(-studio, -mpaa_rating))
# Model summary
summary(model2)
##
## Call:
## lm(formula = imdb_rating ~ ., data = train %>% select(-studio,
## -mpaa_rating))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61115 -0.19415 0.01798 0.23983 0.91816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7042453 0.2084548 17.770 < 2e-16 ***
## title_typeFeature Film -0.3757960 0.1478562 -2.542 0.011391 *
## title_typeTV Movie -0.0702280 0.3012984 -0.233 0.815809
## genreAnimation -0.7007946 0.1956723 -3.581 0.000382 ***
## genreArt House & International 0.3328247 0.1478415 2.251 0.024886 *
## genreComedy -0.1268088 0.0751275 -1.688 0.092168 .
## genreDocumentary 0.2723952 0.1584076 1.720 0.086243 .
## genreDrama 0.1202687 0.0665707 1.807 0.071534 .
## genreHorror 0.1302281 0.1095729 1.189 0.235304
## genreMusical & Performing Arts 0.1295539 0.1424317 0.910 0.363561
## genreMystery & Suspense 0.1817804 0.0824206 2.206 0.027957 *
## genreOther -0.0165435 0.1344770 -0.123 0.902149
## genreScience Fiction & Fantasy 0.0453383 0.1519427 0.298 0.765552
## runtime 0.0035995 0.0009956 3.615 0.000336 ***
## imdb_num_votes 0.0608258 0.0145287 4.187 3.45e-05 ***
## critics_ratingFresh 0.2803119 0.0468880 5.978 4.81e-09 ***
## critics_ratingCertified Fresh 0.3304061 0.0595687 5.547 5.15e-08 ***
## audience_score 0.0315804 0.0012922 24.440 < 2e-16 ***
## best_pic_nom 0.0051778 0.1145458 0.045 0.963967
## best_pic_win 0.0094640 0.2454067 0.039 0.969256
## best_actor_win 0.0300258 0.0528444 0.568 0.570207
## best_actress_win 0.0256085 0.0578457 0.443 0.658208
## best_dir_win 0.0582637 0.0740399 0.787 0.431770
## top200_box -0.0620662 0.1156506 -0.537 0.591779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3705 on 421 degrees of freedom
## Multiple R-squared: 0.8436, Adjusted R-squared: 0.8351
## F-statistic: 98.76 on 23 and 421 DF, p-value: < 2.2e-16
# Predictions
predictions <- predict(model2, newdata = test)
# Residuals
residuals <- test$imdb_rating - predictions
# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error: 0.416620609888723"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100
# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared: 0.813834015692678"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(model2$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared: 0.80084569120612"
Since the p-value for top200_box
,
best_pic_nom
, and best_pic_win
are all
extraordinarily high, we remove all 3 on the next iteration.
Model 3
# Model 3 has 9 independent variables
model3 <- lm(imdb_rating ~ . ,
data = train %>% select(-studio, -mpaa_rating,
-top200_box, -best_pic_nom, -best_pic_win))
# Model summary
summary(model3)
##
## Call:
## lm(formula = imdb_rating ~ ., data = train %>% select(-studio,
## -mpaa_rating, -top200_box, -best_pic_nom, -best_pic_win))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.60980 -0.19122 0.01764 0.24198 0.91975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7077604 0.2058612 18.011 < 2e-16 ***
## title_typeFeature Film -0.3745175 0.1473637 -2.541 0.011395 *
## title_typeTV Movie -0.0699695 0.3002957 -0.233 0.815872
## genreAnimation -0.6941699 0.1946752 -3.566 0.000404 ***
## genreArt House & International 0.3381596 0.1470126 2.300 0.021921 *
## genreComedy -0.1214962 0.0742509 -1.636 0.102520
## genreDocumentary 0.2781773 0.1575180 1.766 0.078115 .
## genreDrama 0.1263327 0.0653955 1.932 0.054047 .
## genreHorror 0.1362665 0.1086765 1.254 0.210579
## genreMusical & Performing Arts 0.1360975 0.1412795 0.963 0.335936
## genreMystery & Suspense 0.1879618 0.0813948 2.309 0.021409 *
## genreOther -0.0096501 0.1332275 -0.072 0.942291
## genreScience Fiction & Fantasy 0.0438712 0.1514347 0.290 0.772185
## runtime 0.0035830 0.0009803 3.655 0.000289 ***
## imdb_num_votes 0.0598607 0.0143637 4.167 3.73e-05 ***
## critics_ratingFresh 0.2797171 0.0467129 5.988 4.53e-09 ***
## critics_ratingCertified Fresh 0.3272039 0.0582933 5.613 3.60e-08 ***
## audience_score 0.0315893 0.0012821 24.639 < 2e-16 ***
## best_actor_win 0.0292446 0.0525009 0.557 0.577800
## best_actress_win 0.0258559 0.0572403 0.452 0.651710
## best_dir_win 0.0605614 0.0728118 0.832 0.406017
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3693 on 424 degrees of freedom
## Multiple R-squared: 0.8435, Adjusted R-squared: 0.8362
## F-statistic: 114.3 on 20 and 424 DF, p-value: < 2.2e-16
# Predictions
predictions <- predict(model3, newdata = test)
# Residuals
residuals <- test$imdb_rating - predictions
# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error: 0.417005268418771"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100
# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared: 0.813490089394011"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(model3$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared: 0.803898151134274"
The last 3 binary variables are dropped for the next iteration.
Model 4
# Model 4 has 6 independent variables
model4 <- lm(imdb_rating ~ . ,
data = train %>% select(-studio, -mpaa_rating,
-top200_box, -best_pic_nom, -best_pic_win,
-best_actor_win, -best_actress_win, -best_dir_win))
# Model summary
summary(model4)
##
## Call:
## lm(formula = imdb_rating ~ ., data = train %>% select(-studio,
## -mpaa_rating, -top200_box, -best_pic_nom, -best_pic_win,
## -best_actor_win, -best_actress_win, -best_dir_win))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61836 -0.19334 0.02005 0.24308 0.90478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6650865 0.2020456 18.140 < 2e-16 ***
## title_typeFeature Film -0.3736089 0.1470474 -2.541 0.011415 *
## title_typeTV Movie -0.0691443 0.2987544 -0.231 0.817082
## genreAnimation -0.6938945 0.1942445 -3.572 0.000394 ***
## genreArt House & International 0.3397787 0.1462813 2.323 0.020660 *
## genreComedy -0.1169948 0.0736708 -1.588 0.113009
## genreDocumentary 0.2801549 0.1570645 1.784 0.075184 .
## genreDrama 0.1327046 0.0644934 2.058 0.040231 *
## genreHorror 0.1337597 0.1083805 1.234 0.217819
## genreMusical & Performing Arts 0.1323506 0.1408817 0.939 0.348033
## genreMystery & Suspense 0.1993319 0.0802253 2.485 0.013350 *
## genreOther -0.0045928 0.1327615 -0.035 0.972420
## genreScience Fiction & Fantasy 0.0450725 0.1509859 0.299 0.765450
## runtime 0.0038872 0.0009404 4.133 4.30e-05 ***
## imdb_num_votes 0.0615548 0.0142459 4.321 1.94e-05 ***
## critics_ratingFresh 0.2836416 0.0464545 6.106 2.30e-09 ***
## critics_ratingCertified Fresh 0.3288815 0.0581081 5.660 2.78e-08 ***
## audience_score 0.0315546 0.0012772 24.706 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3686 on 427 degrees of freedom
## Multiple R-squared: 0.843, Adjusted R-squared: 0.8368
## F-statistic: 134.9 on 17 and 427 DF, p-value: < 2.2e-16
# Predictions
predictions <- predict(model4, newdata = test)
# Residuals
residuals <- test$imdb_rating - predictions
# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error: 0.418305186764801"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100
# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared: 0.812325473210955"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(model4$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared: 0.805999365566381"
Let’s see how each model compares to the full.
# Change results columns to numeric and round
results %>% mutate_at(vars(2:5), as.numeric) %>%
mutate_if(is.numeric, round, digits = 4)
## # A tibble: 4 × 5
## model rmse pct_rmse rsq adj_rsq
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Full Model 0.413 6.27 0.817 0.802
## 2 Model 2 0.417 6.33 0.814 0.801
## 3 Model 3 0.417 6.34 0.814 0.804
## 4 Model 4 0.418 6.36 0.812 0.806
While the full model provided the lowest RMSE, indicating the smallest average prediction error, it also had the lowest adjusted R-squared value. This suggests that, despite its lower prediction error, the full model may be overfitting to the training data and may not generalize as well to new data. On the other hand, Model 4, despite having a slightly higher RMSE, had the highest adjusted R-squared value, suggesting that it may provide a better balance between fit and generalizability.
Predictions
Our final test of predictions is based upon 3 films released in 2016 that were not included in the original data.
Title | URLs |
---|---|
Doctor Strange | Rotten
Tomatoes IMDb |
Zoolander 2 | Rotten
Tomatoes IMDb |
Deadpool | Rotten
Tomatoes IMDb |
# New data frame for outside data
new_test <- tibble(title = c("Doctor Strange","Zoolander 2", "Deadpool"),
title_type = c("Feature Film", "Feature Film", "Feature Film"),
genre = c("Action & Adventure", "Comedy", "Action & Adventure"),
runtime = c(115, 102, 108),
mpaa_rating = c("PG-13", "PG-13", "R"),
studio = c("Disney", "Paramount", "Fox"),
thtr_rel_year = c(2016, 2016, 2016),
thtr_rel_month = c(NA, NA, NA),
thtr_rel_day = c(NA, NA, NA),
dvd_rel_year = c(NA, NA, NA),
dvd_rel_month = c(NA, NA, NA),
dvd_rel_day = c(NA, NA, NA),
imdb_rating = c(7.5, 4.7, 8),
imdb_num_votes = log(c(769000, 73000, 1100000)),
critics_rating = factor(c("Certified Fresh", "Rotten", "Certified Fresh"),
levels = c("Rotten", "Fresh", "Certified Fresh")),
critics_score = c(NA, NA, NA),
audience_rating = c(NA, NA, NA),
audience_score = c(85, 20, 90),
best_pic_nom = c(0, 0, 0),
best_pic_win = c(0, 0, 0),
best_actor_win = c(0, 0, 1),
best_actress_win = c(1, 0, 1),
best_dir_win = c(0, 0, 0),
top200_box = c(0, 1, 0),
director = c(NA, NA, NA),
actor1 = c(NA, NA, NA),
actor2 = c(NA, NA, NA),
actor3 = c(NA, NA, NA),
actor4 = c(NA, NA, NA),
actor5 = c(NA, NA, NA),
imdb_url = c(NA, NA, NA),
rt_url = c(NA, NA, NA))
# Predictions for new data with all 4 models
new_test$full_model_y <- predict(full_model, newdata = new_test)
new_test$full_model_res <- new_test$imdb_rating - new_test$full_model_y
new_test$model2_y <- predict(model2, newdata = new_test)
new_test$model2_res <- new_test$imdb_rating - new_test$model2_y
new_test$model3_y <- predict(model3, newdata = new_test)
new_test$model3_res <- new_test$imdb_rating - new_test$model3_y
new_test$model4_y <- predict(model4, newdata = new_test)
new_test$model4_res <- new_test$imdb_rating - new_test$model4_y
How did each model perform on new data?
# All model predictions
new_test %>% select(title, imdb_rating,
full_model_y,
model2_y,
model3_y,
model4_y) %>%
mutate_if(is.numeric, round, digits = 3)
## # A tibble: 3 × 6
## title imdb_rating full_model_y model2_y model3_y model4_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Doctor Strange 7.5 7.55 7.61 7.60 7.58
## 2 Zoolander 2 4.7 4.78 4.82 4.88 4.89
## 3 Deadpool 8 7.76 7.79 7.78 7.74
# All models residuals
new_test %>% select(title, imdb_rating,
full_model_res,
model2_res,
model3_res,
model4_res) %>%
mutate_if(is.numeric, round, digits = 3)
## # A tibble: 3 × 6
## title imdb_rating full_model_res model2_res model3_res model4_res
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Doctor Strange 7.5 -0.054 -0.107 -0.095 -0.084
## 2 Zoolander 2 4.7 -0.079 -0.119 -0.179 -0.191
## 3 Deadpool 8 0.236 0.208 0.222 0.264
# Average residuals per model on new data
tibble(full_model = abs(new_test$full_model_res),
model2 = abs(new_test$model2_res),
model3 = abs(new_test$model3_res),
model4 = abs(new_test$model4_res)) %>%
colSums() / 3
## full_model model2 model3 model4
## 0.1232441 0.1449902 0.1652650 0.1795911
The Full Model appears to outperform on our test data and our outside data despite its lower \(R^{2}_{\text{adj}}\). The RMSE for all 4 models in the outside data is significantly lower than the 0.41 error in the test data. Therefore, Model 4 will be chosen as the final parsimonious model.
Diagnostics
We put the model data in its own data frame to make plotting the diagnostics easier.
# Put residuals & fitted values into data frame for ggplot
model_data <- data.frame(residuals = model4$residuals,
fitted_values = model4$fitted.values,
index = 1:length(model4$residuals))
Linear Relelationships Between Y & X (numeric)
# plot(full_model$residuals ~ train$runtime)
ggplot(train, aes(x = runtime, y = model_data$residuals)) +
geom_point() +
geom_hline(yintercept = 0, color="red", linetype="dashed") +
labs(y="Residuals") +
theme_bw()
# plot(full_model$residuals ~ train$imdb_num_votes)
ggplot(train, aes(x = imdb_num_votes, y = model_data$residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(y = "Residuals") +
theme_bw()
# plot(full_model$residuals ~ train$audience_score)
ggplot(train, aes(x = audience_score, y = model_data$residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(y = "Residuals") +
theme_bw()
All 3 numeric variables are randomly dispersed when plotted against the model residuals. This implies linearity in the numeric predictors and the response variable, and homoscedasticity which implies a constant variance accross all 3 numeric variables.
The residuals of the model, when plotted against each of the three numeric predictors, are randomly dispersed. This pattern suggests
Linearity - each predictor has a linear relationship with the response variable.
Homoscedasticity - the variance of the errors is constant across all levels of these predictors.
Nearly Normal Residuals
# hist(full_model$residuals)
ggplot(model_data, aes(x = residuals)) +
geom_histogram(binwidth = 0.1, fill = "red", color = "black", alpha = 0.7) +
labs(title = paste("Histogram of Residuals")) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
A histogram of residuals reveals nearly normalization with a slight left skew and a mean = 0.
ggplot(model_data, aes(sample = residuals)) +
geom_qq_line(color = "red", linetype = "dashed") +
geom_qq(size = 2, color = "black", alpha = 0.5) +
labs(title = "Normal Q-Q Plot of Residuals",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
The Normal Q-Q Plot suggests normality with a majority of points along the line with a slight skew still visible on the left.
Constant Variability of Reisudals
# plot(full_model$residuals ~ full_model$fitted.values)
ggplot(model_data, aes(x = fitted_values, y = residuals)) +
geom_point(color = "black", fill = "darkgray", size = 2) +
geom_hline(yintercept = 0, color="red", linetype = "dashed") +
labs(x = "Fitted Values", y = "Residuals") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
# plot(abs(full_model$residuals) ~ full_model$fitted.values)
ggplot(model_data, aes(x = fitted_values, y = abs(residuals))) +
geom_point(color = "black", fill = "darkgray", size = 2) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(x = "Fitted Values", y = "Residuals") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
A check for constant variability of residuals along the predicted values suggests that the model does not perform any better or worse in certain ranges of the predicted values.
Independence of Residuals
# plot(full_model$residuals)
ggplot(model_data, aes(x = index, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(x = "Index", y = "Residuals") +
theme_bw()
The random scatter around 0 implies that the residuals in the model are independent of themselves.
Conclusion
The primary objective of this analysis was to develop a Multiple Linear Regression model capable of predicting IMDb movie ratings using various independent variables related to the movie’s characteristics and critics.
The full model, which included all 14 independent variables, achieved a Root Mean Squared Error (RMSE) of 0.4109 and an Adjusted R-squared value of 0.802. This model performed relatively well in terms of prediction accuracy, as evidenced by its low RMSE.
Through a series of iterations, we eliminated variables with high
p-values, multicollinearity issues, and minimal contributions to the
model’s performance. This process led to the development of Model 4,
which included six independent variables: title_type
,
genre
, runtime
, critics_rating
,
audience_score
, and imdb_num_votes.
Despite
having a slightly higher RMSE of 0.418, Model 4 achieved the highest
Adjusted R-squared value of 0.806, indicating a better balance between
fit and generalizability.
When tested on a small set of new movies released in 2016, the full model outperformed the other models in terms of prediction accuracy, suggesting that it may be better suited for predicting ratings for new movies not included in the original dataset.
Overall, the analysis successfully developed multiple regression models capable of predicting IMDb movie ratings with reasonable accuracy. The choice between the full model and Model 4 would depend on the specific requirements and constraints of the use case. If the primary objective is to achieve the highest possible prediction accuracy, the full model may be preferred, despite the risk of overfitting. On the other hand, if generalization and interpretability are more important considerations, Model 4 could be the better choice due to its higher Adjusted R-squared value and simpler structure.
Future attempts at model construction could involve exploring alternative modeling techniques, such as ensemble methods or machine learning algorithms, to potentially improve prediction accuracy further. Additionally, incorporating more diverse and up-to-date data sources, as well as considering additional movie-related features, could enhance the models’ predictive capabilities.