Tag Archives: American Film

Anatomy of an Analysis (Part 2) – The Enrichening

In the first part of this analysis, I turned a short list of movies into a database that could be used to answer basic questions about the list’s contents. Now I’d like to broaden this analysis by combining the original list with additional outside information — a process called data enrichment.

First, I needed to find and process a new set of data. In this case, I chose a list of the Best Movies of All Time compiled by popular film review aggregator, Rotten Tomatoes because I thought it might include movies that were more popular with a general audience. The RT list ranks movies by their adjusted Tomatometer rating (as of mid-August 2015) and pulls out the top 100. I copied this list over to a spreadsheet and created fields for Rank, Film, Year, and Decade.

Once this information was ready, I used the name of the movie itself to join the RT list to the original BBC list. This approach, while perfectly reasonable, does come with a certain level of risk because the two sources do not always match perfectly. When that happens you have to match the information by hand. Can you spot the problems associated with each pair of names below?

Best Movies (RT) Greatest American Films (BBC)
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb Dr Strangelove
E.T. the Extra-Terrestrial ET: The Extra-Terrestrial
It’s a Wonderful Life It’s a Wonderful Life
One Flew Over the Cuckoo’s Nest One Flew Over the Cuckoo’s Nest
Schindler’s List Schindler’s List
The Godfather Part II The Godfather Part II

The first mismatch is pretty obvious because Rotten Tomatoes includes the full tagline for the movie Dr. Strangelove in the title while the BBC does not. However, there are also some subtle differences in punctuation (such as the period after the abbreviation of “doctor” in the first column) that would still cause problems during a join. These punctuation issues show up more clearly with the second item which has differences in both the abbreviation of “E.T.” and the inclusion of a colon in the BBC version of the title. It gets more subtle from there!

The next three movie titles all contain a contraction or a possessive noun but one source uses an apostrophe while the other uses a single quotation mark. (To make this problem even harder to spot, some web browsers render them both the same. Check the page source.). Finally, the last paired items look identical … except that the first listing of The Godfather Part II includes a trailing space. Pretty esoteric, I know, but that is life in the data world.

With the two data sources aligned, I then created my final enhanced database and explored the information using another pivot table.
The first thing I noticed when I compared the BBC list to the Rotten Tomatoes list is that they only had 22 films in common. This surprised me a little at first but it makes sense when you realize that the RT list is not limited to American films. It also seemed to support my initial instinct that the RT database would contain many more recent films due to its online format.

TOP DECADES IN FILM (BBC vs. Rotten Tomatoes)

A quick look at the films by source and decade (above) shows a huge number of recent films in the RT listing (including one, Mad Max: Fury Road, that was still in theaters when I first downloaded the data). It is also interesting to note that the spike in “best” movies for Rotten Tomatoes occurs in the 1950s instead of the 1970s. However, the large number of foreign films in the RT list for the 1950s leads quickly to discussions of Japan’s “Golden Age” of cinema during that time period.


Another interesting view of this information can be seen when you compare the two ranked lists side-by-side. The chart above shows the 22 films that appear on both lists with a line connecting their two ranks. This makes it easy to see where the sources agree and where they disagree. Several of the critical darlings (Citizen Kane, The Godfather, Singin’ in the Rain, and North by Northwest) also rank high on the RT list while others (many of them from the American New Wave period of the 1970s) show a drop in popularity. Meanwhile, other classically popular films like The Wizard of Oz and ET: The Extra-Terrestrial float upward.

Anatomy of an Analysis (Part 1)

A few weeks ago, the BBC News produced a list of the top 100 greatest American films based on input from critics from around the world.

Here are the top ten films presented in rank order:

  1. Citizen Kane (Orson Welles, 1941)
  2. The Godfather (Francis Ford Coppola, 1972)
  3. Vertigo (Alfred Hitchcock, 1958)
  4. 2001: A Space Odyssey (Stanley Kubrick, 1968)
  5. The Searchers (John Ford, 1956)
  6. Sunrise (FW Murnau, 1927)
  7. Singin’ in the Rain (Stanley Donen and Gene Kelly, 1952)
  8. Psycho (Alfred Hitchcock, 1960)
  9. Casablanca (Michael Curtiz, 1942)
  10. The Godfather Part II (Francis Ford Coppola, 1974)

There is really nothing too surprising here. Perennial favorite Citizen Kane tops the list followed by The Godfather and Vertigo — two of the most famous films (by two of the most famous directors) ever produced. Perusing the full list, you might recognize a few other titles and maybe think about adding some of them to your Netflix queue. But that’s about it. Aside from a handful of ancillary stories, there was little additional commentary to draw you deeper into the story. Sensing an opportunity, I decided to use this list to demonstrate the steps involved in a quick and simple analysis of data found “in the wild.”

Here follows a demonstration of my 5-step program for data analysis:


The BBC asked each critic to submit a list of the ten films they felt were the greatest in American cinema (“… not necessarily the most important, just the best.”). For the project, an “American film” was defined as any movie that received funding from a U.S. source. This criteria included many films by foreign directors as well as films shot outside of the country. The highest ranking films on each list received ten points, the next film down received nine points, and so on. The tenth pick received one point. All the points were then tallied to produce the final list.


Even though the resulting “listicle” is fairly simple, it contains a lot of interesting information just waiting to be freed from its icy confines. I pulled the list into Excel and used some very basic string (text) manipulation to create four basic fields from each row of information:

Additional manipulation of the “Year” field yields a useful grouping category:

With the creation of these five fields, I now have a flexible database instead of a rigid list.


The final data set was “stored” as a table in a simple spreadsheet. Although I have many problems using Excel for data storage (more on that in a future post), it is a quick and easy way to organize small sets of data.


Once the data was in the format I wanted, I created a pivot table that allowed me to manipulate information in different ways. I was particularly interested in answering questions like “Who are the top directors?”, “When were most of these films made?”, and “Was there ever a ‘Golden Age’ of modern cinema?” Most of these questions can be answered through simple grouping and summarization.


After all that work, it’s time to pull together the results and display them in some way. For this exercise, that means a few simple tables and charts:


Director # of Films in the Top 100
Stanley Kubrick 5
Steven Spielberg 5
Alfred Hitchcock 5
Billy Wilder 5
Francis Ford Coppola 4
Howard Hawks 4
Martin Scorsese 4
John Ford 3
Orson Welles 3
Charlie Chaplin 3


Year # of Films in the Top 100
1975 5
1980 4
1974 4
1959 4
1939 3
1941 3
1977 3
1946 3
1994 3


These simple presentation tools start to tell some interesting stories and — like all good analysis tools — start to hint at additional avenues of exploration. For example, while two of the directors with five films in the Top 100 (Kubrick, Hitchcock) also made it into the Top 10, the other two (Spielberg and Wilder) did not … why? The year with the most films on the list was 1975 … what were the films? The 1970s account for over 20% of the films on the list … what was going on in the culture that lead to this flowering of expression?

It would have been really great if the BBC article had included some sort of interactive tool that allowed readers to explore the database themselves. I will see what I can do to tackle this in an upcoming post.