Why Can't I Predict the Oscars?

by Bayne Brannen

Summary

I use an IMDB dataset and scrape Rotten Tomatoes in order to attempt to predict Oscar-winning films using a Random Forest Regression model.

Python
Web scraping with an API, BeautifulSoup, and the requests library
Data cleaning with pandas
Machine learning with the scikit-learn
Statistical analysis

Alright, I admit it—it’s a naive project. So why embark on this little journey? Well, first of all, I’m fascinated by films, especially as they pertain to the cultural zeitgeist and popular culture. While still an extreme honor, the Oscars have definitely had their hits and their misses. Look at the winner in 2020: we have the critically acclaimed Parasite which seems to have captivated both critical and general audiences and held the honor of being the first foreign language film to win the Oscars throughout it’s 93-year history—most seemed to agree this was an unerring decision. And then in 2019, we had a victor of Green Book which had a decent reception from critics (although not that decent), but whose victory received a lot of backlash for being idealistic, unchallenging… well, Oscar-bait. So, while culturally there appears to be disagreement over what deserves an Oscar, I am curious to see if the data on hand is less contentious. Regardless of whether or not I am able to predict which film would win an Oscar, I am quite interested in which factors are most predictive.

First, the dataset. I am going to use this dataset from Kaggle which was uploaded by user Stefano Leone. The dataset contains over 85,000 movies going back to the early 1900s, but I decided to narrow it down to just the past 20 years leaving me with approximately 48,000 films. I made this decision for a couple reasons. For one thing, I have a hunch that Oscar trends are fleeting and would vary significantly across many decades. Secondly, Rotten Tomatoes (which I believe will be a major predictive factor) was first founded in 1998, so any films that came out before then will have mixed data as to when the reviews came in and their scores will likely be skewed.

“But Bayne, this dataset doesn’t have any data on Rotten Tomatoes scores!” Not yet, my friend. Patience. Before we get there, this dataset also does not have any data on whether or not these films even won Oscars. This does beg the question: what do I mean by a film having won an Oscar? I’m going to cast a wide net first using the OMDB API (essentially an unofficial IMDB api) to collect my data. Since I have the IMDB id for each film, I will simply query the database with the ID, and grab the "award" field of the data. Then, I will clean it up using regular expressions and then will end up with the number of Oscars that each film won. I will also end up going in and manually creating a column with a binary to indicate whether or not a film won an Oscar for Best Picture specifically. I’m not sure I’ll end up using the number of Oscars each film won in the end, as there are so many different kinds of Oscars each film could win, but I might try it. It was easy enough to collect anyway, so why not?

client = omdb.OMDBClient(apikey='9b6f6a00')

def get_award(imdbId):
    try:
        return client.imdbid(imdbId)["awards"]
    except:
        return "Not found"
movies["awards"] = movies.iloc[:,0].apply(lambda x : get_award(x))

movies["oscars_won"] = movies[movies["awards"].str.contains(r'Won [1-9]* Oscar', regex=True)]["awards"].str.extract('(\d+)')
movies["oscars_won"] = movies["oscars_won"].fillna(0)
movies["oscars_won"] = movies["oscars_won"].astype(int)

Okay, now we can get to the meat and tomatoes of this thing. The Rotten Tomatoes scores are going to be a bit more difficult to pull. Unlike with the OMDB API, there is not a handy ID I can use to find the film in the Rotten Tomatoes database, nor is there a Rotten Tomatoes API available for hobbyists. Looks like I’ll have to build an old-fashioned web crawler using the requests and BeautifulSoup library. You might think that I would be fine using the title of each movie to search for it’s Rotten Tomatoes score, but then we might end up with a rather high score for the film Parasite from 2004. So instead, I am going to combine the name of the film and the director’s name into a search query that I will use to search for the film. It may not be absolutely, 100% fool-proof, but it’s almost definitely going to retrieve the most popular films—the ones we need.

movies['tomatoes_search'] = movies['title'] + " " + movies['director']
movies['tomatoes_search'] = movies['tomatoes_search'].apply(lambda x : urllib.parse.quote_plus(str(x)))
movies['tomatoes_search'] = movies['tomatoes_search'].str.replace("\+", "%20")

There is also an issue of the independence of variables when it comes to Rotten Tomatoes. If a film wins an Oscar or is nominated for an Oscar, it is likely to see an influx of reviews by those trying to capitalize on that nomination or victory—this may even work in the inverse where a film does not receive an Oscar nomination so critics feel more comfortable giving it poor reviews. To resolve this issue, I’ve manually collected the dates of the Oscar announcements from the past twenty years and labeled it by year. Using this list, I will make sure that any Rotten Tomatoes review I collect was written before the nomination date of that year’s Oscars.

nominee_announcements = pd.read_csv("oscars_nominee_announcements.csv")
nominee_announcements['Date'] = pd.to_datetime(nominee_announcements['Date'])
nominee_announcements["Year"] = nominee_announcements["Year"] - 1
nominee_announcements = nominee_announcements.rename(columns={'Year' : 'year', 'Date' : 'nom_date'})

movies = movies.merge(nominee_announcements, how='left', left_on = 'year', right_on = 'year')

My web crawler will cycle through all the reviews written for each film and create a list of the words “fresh” and “rotten” for each good and bad review respectively. This will allow me to see what the Rotten Tomatoes score of each film would have been before it was or wasn’t nominated for an Oscar and will allow me to count the number of reviews overall. I will use three functions to accomplish this task. My handler function, fresh_list will attempt to access the correct url (with some backup error handling in case we’ve accessed Rotten Tomatoes too many times too quickly), create a BeautifulSoup object out of the HTML, and then loop through each page of reviews. Along the way, it will use the get_review_list function which will return the actual list of ‘fresh’ and ‘rotten’ reviews, but only if the date of the review is less than the nomination date of the Oscars the year the film was released. It accomplishes this by using two other functions which I believe I’ve named quite aptly: get_review_date and get_review_rating. The get_review_rating function returns each individual review as get_review_list cycles through them, and get_review_date returns the date of the review so that it can be compared to the nomination date. This takes quite a while—around three days actually. Note that while I have omitted the in-line comments of my code for most of this post, I will supply them here as there is a lot going on. This process definitely takes a lot of time (approximately 30 hours)

def fresh_list(search_uri, nom_date):

    # Here I use an error handling exception to make sure I don't timeout because Rotten Tomatoes wants me to slow down with my
    # scraping. I have found that this keeps the crawler moving with just a few exceptions cropping up throughout the crawl
    # First I simply try accessing the url.

    try:
        search_url = requests.get('https://www.rottentomatoes.com/search?search=' + search_uri)

    # If there's some kind of error, the program just waits 15 seconds and tries again. If it still doesn't wait, it waits 20
    # more seconds and tries one last time. At this point, if it fails, I should just stop because Rotten Tomatoes clearly wants
    # that of me; but with this handling, I didn't have any problems with that.

    except:
        try:
            print("Too many requests.")
            time.sleep(15)
            search_url = requests.get('https://www.rottentomatoes.com/search?search=' + search_uri)
        except:
            print("Try again. Too many requests.")
            time.sleep(20)
            search_url = requests.get('https://www.rottentomatoes.com/search?search=' + search_uri)


    # I'll implement another error handling exception just so that—if for any reason the html is not suitable for handling
    # (e.g. it is essentially empty of the necessary elements or has no reviews) then it will be skipped and return NaN

    try:

        # I use the BeautifulSoup library in order to easily parse the html from the webpage

        soup = BeautifulSoup(search_url.content, 'html.parser')

        # I specify that I want the list of search results from the page

        html = json.loads(soup.find(id='movies-json').text)

        # I retrieve the url of the first search result, which I can hope is the film I was searching for and create a url
        # to take us to the reviews page based on the Rotten Tomatoes format

        reviewpage = requests.get(html['items'][0]['url'] + "/reviews?type=&sort=&page=1")

        # I then create a BeautifulSoup object for the reviews page itself

        soup = BeautifulSoup(reviewpage.content, 'html.parser')

        # I grab the text "Page 1 of x" fron this page so that I will be able to extract how many pages I will need to cycle
        # through

        page_number = soup.find(class_='pageInfo')

        # I create an empty list variable for the storing of the rotten/fresh reviews

        fresh_rotten_list = []

        # This if statement just checks to make sure there is more than one page based on the page_number 

        if page_number is not None:
            # Uses regex to extract the second number aka the total number of pages. Also converts it to an integer.
            page_number = page_number.text
            page_pat = '(?<=of )[0-9]{1,2}'
            final_page = re.findall(page_pat, page_number)[0]
            final_page = int(final_page)

            # This for loop just iterates through the pages using the 'final_page' variable and grabs the rotten/fresh scores
            # from each page using the get_review_list function, appending each resulting list to the overall fresh_rotten_list

            for page in range(1, final_page+1):
                reviewpage = requests.get(html['items'][0]['url'] + "/reviews?type=&sort=&page=" + str(page))
                soup = BeautifulSoup(reviewpage.content, 'html.parser')
                fresh_rotten_list.append(get_review_list(soup, nom_date))

        # If there is no page indicator, this means either there are no reviews or only one page of reviews. In this case,
        # I execute a error handling exception that tries to use the get_review_list function to return the list of review
        # results from the one page, but if the review table is inaccessible (i.e. there are no reviews) then it simply returns
        # a NaN value rather than a review list.

        else:
            try:
                fresh_rotten_list.append(get_review_list(soup, nom_date))
            except:
                fresh_rotten_list.append(None)
    except:
        return None

    # If all went well, the function will return a list with a certain number of 'fresh' and 'rotten' strings to be counted
    # later via vectorized functions on the dataframe

    return fresh_rotten_list

# This function will retrieve the reviews from rotten tomatoes as a list consisting of the words fresh and rotten. 
# The number of instances of each word will indicate how many fresh and rotten reviews there are respectively.
# This function takes the parameters 'reviews_soup', the BeautifulSoup object of the reviews page, which it will 
# receive from the fresh_list function that actually handles the initial scraping of the page. It also takes the parameter
# nom_date so it can pull only reviews that happened before the nominations.

def get_review_list(reviews_soup, nom_date):

    # Creates the list where the words 'rotten' and 'fresh' will be stored

    rating_list = []

    # Uses BeautifulSoup to find all of the elements on the page that constitute the review summaries.

    reviews = reviews_soup.find_all(class_='row review_table_row')

    # It cycles through all of these review summaries and retrieves the rating of each review only if it was released before
    # the film was nominated for an Oscar.

    for review_num in range(0, len(reviews)):
        review_row = reviews_soup.find_all(class_='row review_table_row')[review_num]

        # I use the get_review_date function here to get the date of the review.

        review_date = get_review_date(reviews_soup, review_row)

        # Here, the function determines whether the review date was before the nomination date. If it was, then it pulls the
        # rating "fresh" or "rotten" from that review. Otherwise, the function simply passes and skips the row.

        if review_date < nom_date:
            rating_list.append(get_review_rating(reviews_soup, review_row))
        else:
            pass

    return rating_list

# Returns the date from the review by using regex to extract the date from the BeautifulSoup object and converting it to a 
# datetime object

def get_review_date(page_soup, review_row):
    date = review_row.find(class_="review-date subtle small").text
    date_pat = r'[A-Z][a-z]{1,9}[ ][0-9]{1,2}[,][ ][0-9]{4}'
    date = re.findall(date_pat, date)[0]
    date = datetime.datetime.strptime(date, '%B %d, %Y')
    return date

# Returns the rating of a review as a string by finding the icon title for the review and just returning the part of the name
# of the icon that matters—'fresh' or 'rotten'.

def get_review_rating(page_soup, review_row):

    rating = review_row.find(class_=lambda class_: class_ and class_.startswith("review_icon icon small"))
    rating = rating["class"][3]
    return rating

Okay, so now I have several features I think might prove useful when predicting the Oscars, but they still need to be cleaned up a little bit in anticipation of the analysis. First of all, my Rotten Tomatoes data are still just lists of the words ‘fresh’ and ‘rotten’, so it’s time to quantify those by creating a column for the number of Rotten Tomatoes score and then of course the ever-important Rotten Tomatoes score.

movies_twenty_years['rt_num'] = movies_twenty_years['rt_list'].apply(lambda x : Counter(x)['fresh'] + Counter(x)['rotten'])
movies_twenty_years['rt_score'] = movies_twenty_years['rt_list'].apply(lambda x : Counter(x)['fresh'] / (Counter(x)['fresh'] + Counter(x)['rotten']) if (Counter(x)['fresh'] + Counter(x)['rotten']) > 0 else None)

Now that I have the Rotten Tomatoes scores prepped for the model, I'd like to prepare the genre which I think will be another interesting feature to try. Currently, the genre is held in the form of a string with multiple genres separated by commas. This is not ideal for the model as I'd like to see the importance of each genre as an individual feature. So I will first use a vectorized string function to split the string by the commas into its own dataframe with the IMDB id as the index. I then use a lambda function to strip the empty spaces off of each genre. Finally, I will use the pandas method get_dummies to turn the genres into columns with a binary indicator as to whether or not the genre was originally in that row. Then I merge the original dataset with this new dataframe using the IMDB id. Now in the original dataset I have columns that indicate whether or not each film falls into each genre.

cleaned_genre = movies.set_index(['imdb_title_id'])["genre"].str.split(',',expand=True).stack()
cleaned_genre = cleaned_genre.apply(lambda x : x.strip())
genre_encoded = pd.get_dummies(cleaned_genre, prefix='g').groupby(level=0).sum().reset_index()

movies = movies.merge(genre_encoded, left_on=["imdb_title_id"], right_on=["imdb_title_id"])

Another feature I'd like to look at is whether or not a film being American would be predictive of an Oscar-winning film. Since Parasite is the only non-American film to have ever won an Oscar, I imagine this may be somewhat predictive. I will quickly fill in any None values in the country column to be actual NaN values, use the vectorized string function contains to see if it has the USA among one of its countries, and use that to create a new column called American with a True or False indicator as to whether or not it is indeed American.

movies["country"] = movies["country"].fillna('None')
movies["American"] = movies["country"].str.contains("USA")

movies["American"] = movies["American"].fillna(0)

Okay, now it's time to see if a model will be at all predictive of Oscar-winning film. I am going to just try to predict whether a film won Best Picture at this point as I would imagine that the reasons a film might win different awards could be quite different. For instance, a film that wins Best Hair and Makeup may not even be a very good film. So I am going to use the following features to try and predict whether a film does or does not win Best Picture:

features = ['rt_score', 'American', 'g_Action', 'g_Adult',
       'g_Adventure', 'g_Animation', 'g_Biography', 'g_Comedy', 'g_Crime',
       'g_Documentary', 'g_Drama', 'g_Family', 'g_Fantasy', 'g_Film-Noir',
       'g_History', 'g_Horror', 'g_Music', 'g_Musical', 'g_Mystery', 'g_News',
       'g_Reality-TV', 'g_Romance', 'g_Sci-Fi', 'g_Sport', 'g_Thriller']

These features essentially cover three aspects of the film: the Rotten Tomatoes score (before the film was nominated), whether or not the film was American, and whatever genres the film may fall under. Now I will set up the train-test-split, train the model, and then test it out.

X_train, X_test, y_train, y_test = train_test_split(movies_with_rt[features], movies_with_rt["best_picture"], train_size=0.7,test_size=0.3, random_state=1)

regressor = RandomForestRegressor(n_estimators=60, random_state=0)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

I have trained and tested the model, and have calculated the root mean square error. This should essentially show me—on average—how far away my predictions were from the truth. This figure turned out to be approximately 0.077. Is this a good result? The short answer is no. The value we are trying to predict is either 0 or 1, of course most predictions will come out somewhere in between which we can take as a figure meaning the likelihood that the film is a winner. A film that was predicted to have a value of 0.75 has a 75% chance (according to the model) of being a winner of the Best Picture award. Essentially, the closer the prediction is to the number 1, the more likely it is to be a winner. So on the surface, a root square mean error of 0.077 doesn't seem so bad since that would mean my predictions are off by approximately 7% of likelihood. The problem is one that I flagged earlier: the initial dataset is quite unbalanced. I tried to ameliorate this issue by taking only the top 100 Rotten Tomatoes scored films from each year, but even that couldn't save the model because there's such a small sample of winners. Even if the model predicts that every single film is predicted to be a loser, it would still have a 99% accuracy. It almost doesn't matter what the root squared mean error is because the model doesn't privilege predicting a winner over a loser.

I can demonstrate this by actually comparing the model's predictions with the true values. Below I sorted the resulting test split by whether or not it won an Oscar and then create a column for the predictions. Among the top 10, none of them got a particularly good score with Birdman being given essentially a 0% chance of winning Best Picture. The King's Speech was more likely to win, but still with less than a 50% chance.

title	best_picture	preds
Birdman or (The Unexpected Virtue of Ignorance)	1	0.000000
The King's Speech	1	0.366667
American Beauty	1	0.016667
Changing Lanes	0	0.000000
A Series of Unfortunate Events	0	0.000000
The Switch	0	0.000000
Tears of the Sun	0	0.000000
The Spectacular Now	0	0.000000
The Skeleton Twins	0	0.000000
A.I. Artificial Intelligence	0	0.000000

I can also take a look at the films that were the most likely to win an Oscar according to the model.

title	best_picture	preds
The Ballad of Buster Scruggs	0	0.583333
The Town	0	0.416667
The King's Speech	1	0.366667
Hairspray	0	0.350000
All the Money in the World	0	0.333333
Knives Out	0	0.316667
Ve stï¿½nu	0	0.283333
The Pianist	0	0.256667
Between Pain and Amen	0	0.233333
De belofte van Pisa	0	0.216667

Here we can tell that the most likely film to win Best Picture in the entire dataset was The Ballad of Buster Scruggs—a film which was not even nominated for an Oscar. Furthermore, as the most likely to win an Oscar, it is still only around 56% likely to win an Oscar. Given that the root squared mean error is around 7% of that likelihood, it might as well be a 50-50 shot as to whether or not it would win Best Picture. I see a modicum of success here in that The King's Speech was the third most likely film to win Best Picture, but even then the prediction would imply it is more likely to lose than win.

So the failure of the model was somewhat anticipated and predictable, but I'd still like to take a look at the feature importances from the Random Forest regressor to see what features were most predictive:

importances = pd.DataFrame(zip(features, regressor.feature_importances_), columns=["feature", "importance"])
importances.sort_values(by="importance", ascending=False)

feature	importance
rt_score	0.695431
g_Thriller	0.066920
g_Crime	0.041410
g_Biography	0.024588
American	0.023056
g_History	0.021506
g_Action	0.020424
g_Romance	0.018170
g_Comedy	0.015597
g_Fantasy	0.012293
g_Adventure	0.011753
g_Drama	0.011721
g_Musical	0.010130
g_Sport	0.009981
g_Sci-Fi	0.007582
g_Mystery	0.006537
g_Animation	0.001494
g_Horror	0.000862
g_Music	0.000546
g_Family	0.000000
g_Documentary	0.000000
g_Adult	0.000000
g_News	0.000000
g_Reality-TV	0.000000
g_Film-Noir	0.000000

These importances actually make quite a lot of sense to me. I expected the Rotten Tomatoes score to be quite predictive and indeed it was, but throughout this project, I'd also been thinking about the pattern of themes that one sees in Oscar-winning films. We can see common thematic strands running through these films; disability (Million Dollar Baby, The King’s Speech, The Shape of Water, A Beautiful Mind), the film industry (Birdman, The Artist, Argo), rags-to-riches stories (Gladiator, Slumdog Millionaire, Parasite), and crime films (Crash, The Departed, No Country for Old Men, arguably Chicago) to name a few. I think it's no coincedence that the one theme amongst these that is codified into genre (crime) has one of the highest importances as a feature. That being said, the crime film winner seemed to have been particularly popular in the early 2000s. A film that falls squarely in the crime genre has not won since 2007. This brings up another point about predicting the Oscars—the winners may follow certain trends that are hip during the time, but might not last into the next decade.

Looking at the feature importances also made me think of a theme I hadn't thought of before: biographies/true stories. Million Dollar Baby, The King’s Speech, Green Book, Spotlight, Twelve Years a Slave, Argo, and A Beautiful Mind are all either biographical or "based on a true story." I had never thought about this before, but the "biography" and "history" genres having such a high importance illuminated this pattern to me. So even if I wasn't able to predict the Oscars, the model did elucidate some interesting points about what kind of film is more likely to win an Oscar. So next time you're throwing your bets into the Oscar pool for Best Picture, you might want to pick a true story. Nomadland certainly did not buck that trend in 2021!