A few months ago, my friends got into an argument over whether Avengers: Infinity War is a fantasy or a science-fiction film. IMDb lists the movie under both genres, evolving the debate into whether the film is more fantasy or more sci-fi.
IMDb will tell you what genre any movie falls under, but not to what extent. That is, we can agree Inception and Star Trek are both sci-fi films, but who’s to say which is more sci-fi? And no, not which film is more futuristic or extraterrestrial - rather, which film better represents science-fiction as a genre and more strongly captures the elements commonly seen throughout the category. If you were to define “sci-fi” to a young child, which film would be the better example?
Up until now, there’s been no quantifying movie genre classifications. A movie is either a fantasy, or it isn’t. A movie is either a horror, or it isn’t. To get my friends to shut up about Avengers, that needed to change.
Ultimately, Avengers: Infinity War is much more sci-fi than it is fantasy. Having that knowledge, let’s explore the rest of Hollywood’s most popular hits; with data on 2,166 other films, there’s a lot of interesting genre compositions to see (more about the methodology below). Using the distance formula, we can also see which films have similar scores.
Interested in watching a movie that’s the perfect combination of all your favorite genres? Use the sliders below to find what best matches what you’re looking for. What would your favorite movie resemble with half the action? Or twice the fantasy? Remember that each genre is a part of a whole, so your total across all seven groups cannot exceed 100%.
Finally, we can compare all 2,167 films to see which movies have the highest thriller scores, the highest drama scores, etc.4 A movie needs to contain dozens of genre-specific elements and tropes to reach the top, meaning these films are arguably their genres’ most defining works. As a caveat, “most horror” or “most comedy” does not necessarily mean scariest or funniest - just the best examples of what we usually see throughout the overall genre.
Click on any genre to view its most defining films:
Action
Sci-Fi
Horror
Comedy
Drama
Fantasy
Thriller
Methodology
When reviewing a movie on IMDb, users can submit plot keywords to give the film a high-level description. I scraped these words for any movie with at least 50,000 user reviews and 150 submitted keywords. This resulted in data on 2,167 movies - only including films listed in the website’s action, comedy, drama, fantasy, horror, sci-fi, or thriller sections.
I then determined how often each keyword was used to describe films in each genre category (excluding keywords which were used to describe less than 2% of movies in the dataset, which removed overly-specific tags like “klingon” or “hogwarts”).
Most Common Words Used to Describe Each Genre
Rank |
Action |
Comedy |
Drama |
Fantasy |
Horror |
Sci-Fi |
Thriller |
1 |
Murder |
Friendship |
Death |
Death |
Blood |
Explosion |
Murder |
2 |
Explosion |
Title Spoken By Character |
Husband Wife Relationship |
No Opening Credits |
Murder |
Death |
Death |
3 |
Violence |
Father Son Relationship |
Murder |
Good Versus Evil |
Death |
Murder |
Blood |
4 |
Pistol |
Kiss |
Blood |
Rescue |
Surprise Ending |
Violence |
Violence |
5 |
Death |
Husband Wife Relationship |
Father Son Relationship |
Flashback |
Violence |
Chase |
Pistol |
This ultimately did little to showcase unique features of genres and to distinguish them from one another, as most movie tropes are popular enough to span across all categories. “Murder,” for example, was used to describe more action movies than any other keyword, but it’s also a common occurrence in dramas, horrors, sci-fis, and thrillers. To uniquely describe “action,” we need the keywords which often describe action films but rarely describe films in other genres.
Most Genre-Specific Keywords
Rank |
Action |
Comedy |
Drama |
Fantasy |
Horror |
Sci-Fi |
Thriller |
1 |
Terrorist Plot |
Slapstick Comedy |
Based On True Story |
Witch |
Slasher |
Space Travel |
Police Corruption |
2 |
Semiautomatic Pistol |
Flatulence |
1940s |
Magic |
Supernatural Horror |
Spacecraft |
Die Hard Scenario |
3 |
One Man Army |
Masturbation |
"Where Are They Now?" Epilogue |
Princess |
Body Count |
Planet |
Espionage |
4 |
Female Assassin |
Dating |
Depression |
King |
Serial Killer |
Future |
Neo Noir |
5 |
Espionage |
On The Road |
Drug Use |
Queen |
Darkness |
Alien |
Silencer |
The most genre-specific keywords were determined by calculating the relative usage of each keyword for each genre. This helps evaluate how unique a keyword is to any genre. For example, “underwater scene” is used to describe action films 1.69x more often than other genres, sci-fi 1.43x more often, fantasy 1.42x more often, and all other genres less than the average usage. Meaning if we know a film has an underwater scene, we can presume it is likely more of an action, sci-fi, or fantasy than a horror, drama, or comedy. See how other keywords break down:
Finally, any movie’s genre scores was calculated using its keywords’ relative usage scores and summation. For example, any movie containing the “underwater scene” keyword had its action, sci-fi, and fantasy scores slightly incremented. This same process would be applied to any keyword for which relative usage scores were calculated. You can check out the data and .ipynb on GitHub.
Footnotes
- For the purposes of this project, I did not include romance as a genre.
- I initially did include romance, only to find it correlated far too strongly with drama. This meant the data often failed to distinguish between the two genres, leading to lower relative usage scores and cannibalizing each other’s final genre scores. A similar phenomenon happened between action and adventure when adventure was included as its own genre.
- Romance is far more binary than action, fantasy, horror, etc. There is either a love story or there isn’t, making it pretty black and white. Compare this to action: a film can have a lot of action, absolutely no action, or anything in between. Romance is usually either there or it isn’t, with little room in the middle. Therefore, the romance genre doesn’t lend itself to gradient measurement very well.
- The radar charts were created using a library written by Nadieh Bremer and updated by Matthieu Viry.
- The sliders were created using a library written by John Walley.
.