It’s that time of the year when movie fans around the world get glued to their TVs sucking in everything the Oscars have come to represent over the years: the nominees and the snubs, celebrities, designer outfits, rumors of impending breakups or newcomers making the waves, and oh yes, also some of the best movies of the year before. Following on the footsteps of our success last year, on our part, we’re getting ready to once again predict which special performance or production deserves to win this year’s gold-plated statues that signify perhaps the highest achievement in the 131-year-old business of fast-moving pictures.
This year, in an attempt to involve all our readers in this fun exercise (and a nice intro use case to Machine Learning), we’re publishing the corresponding dataset in the BigML gallery. Rest assured we’ve already done most of the hard work to gather and verify the completeness of the data. It sports 20 categorical, 56 numeric, 42 items, and 1 datetime field totaling 119 fields giving you plenty of details about various aspects of the past nominees and winners. The dataset is organized such that each record represents a unique movie identified by the field movie_id. The first 17 fields have to do with the metadata associated with each movie e.g., release_date, genre, synopsis, duration, metascore. The following fields are dedicated to recording the outcomes of past Academy Awards and 19 other relevant awards such as Golden Globes, Screen Actors Guild, BAFTA and more. Finally, we have some automatically generated datetime fields based on the Release Date of the movie entry. Please note that this rather abbreviated dataset comes with the limitation of making predictions based on movie titles only, which means in those instances where multiple persons are nominated from a single movie, you’ll have to make a judgment call between those nominees.
To make your own predictions, you’ll need to perform a time split and create a training dataset spanning the period 2000-2017 as well as a test dataset for the movies released in 2018 — essentially, the nominees for 2019 Oscars. The dataset is prepared in a way to handle multiple awards to save time. So instead of dealing with a different dataset for each award, you can simply drop the unneeded target fields and select as your target field the award you’re trying to predict. For instance, if you’re looking to predict the Best Movie, then you select Oscar_Best_Picture_Won as the target and the rest of fields sharing the naming convention Oscar_XXXXX_Won are to be excluded.
Here are some additional clues for newbies:
- Get familiar with the dataset by building some scatterplot visualizations
- Start with simpler methods like Models or Logistic Regressions to see what fields seem to correlate well with the outcome you’re looking to predict (i.e. use Model Summary Report)
- Add more sophisticated techniques like Deepnets or Ensembles later on
- Execute some side by side Evaluation Comparisons to compare your best performing classification models
- Try an OptiML and see how automatic Machine Learning performs vs. your previous attempts
- For additional peace of mind, validate models with last years predictions as a tie-breaker exercise
- See if you can build some Fusions from your top classifiers to improve the robustness of your predictions further
- Compare your predictions to those of human experts, and better yet, see how they deviate by using the handy predictions explanations feature of BigML.
- BONUS: Go beyond what we supply here and add your own features and Data Transformations to the original movie dataset for an additional edge.
What are you waiting for, join in the fun, impress some friends and let us know how your predictions turn out with a shoutout to @bigmlcom on Twitter!
Leave a Reply