Tag Archives: Charts

Anatomy of an Analysis (Part 2) – The Enrichening

In the first part of this analysis, I turned a short list of movies into a database that could be used to answer basic questions about the list’s contents. Now I’d like to broaden this analysis by combining the original list with additional outside information — a process called data enrichment.

First, I needed to find and process a new set of data. In this case, I chose a list of the Best Movies of All Time compiled by popular film review aggregator, Rotten Tomatoes because I thought it might include movies that were more popular with a general audience. The RT list ranks movies by their adjusted Tomatometer rating (as of mid-August 2015) and pulls out the top 100. I copied this list over to a spreadsheet and created fields for Rank, Film, Year, and Decade.

Once this information was ready, I used the name of the movie itself to join the RT list to the original BBC list. This approach, while perfectly reasonable, does come with a certain level of risk because the two sources do not always match perfectly. When that happens you have to match the information by hand. Can you spot the problems associated with each pair of names below?

Best Movies (RT) Greatest American Films (BBC)
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb Dr Strangelove
E.T. the Extra-Terrestrial ET: The Extra-Terrestrial
It’s a Wonderful Life It’s a Wonderful Life
One Flew Over the Cuckoo’s Nest One Flew Over the Cuckoo’s Nest
Schindler’s List Schindler’s List
The Godfather Part II The Godfather Part II

The first mismatch is pretty obvious because Rotten Tomatoes includes the full tagline for the movie Dr. Strangelove in the title while the BBC does not. However, there are also some subtle differences in punctuation (such as the period after the abbreviation of “doctor” in the first column) that would still cause problems during a join. These punctuation issues show up more clearly with the second item which has differences in both the abbreviation of “E.T.” and the inclusion of a colon in the BBC version of the title. It gets more subtle from there!

The next three movie titles all contain a contraction or a possessive noun but one source uses an apostrophe while the other uses a single quotation mark. (To make this problem even harder to spot, some web browsers render them both the same. Check the page source.). Finally, the last paired items look identical … except that the first listing of The Godfather Part II includes a trailing space. Pretty esoteric, I know, but that is life in the data world.

With the two data sources aligned, I then created my final enhanced database and explored the information using another pivot table.
The first thing I noticed when I compared the BBC list to the Rotten Tomatoes list is that they only had 22 films in common. This surprised me a little at first but it makes sense when you realize that the RT list is not limited to American films. It also seemed to support my initial instinct that the RT database would contain many more recent films due to its online format.

TOP DECADES IN FILM (BBC vs. Rotten Tomatoes)

A quick look at the films by source and decade (above) shows a huge number of recent films in the RT listing (including one, Mad Max: Fury Road, that was still in theaters when I first downloaded the data). It is also interesting to note that the spike in “best” movies for Rotten Tomatoes occurs in the 1950s instead of the 1970s. However, the large number of foreign films in the RT list for the 1950s leads quickly to discussions of Japan’s “Golden Age” of cinema during that time period.


Another interesting view of this information can be seen when you compare the two ranked lists side-by-side. The chart above shows the 22 films that appear on both lists with a line connecting their two ranks. This makes it easy to see where the sources agree and where they disagree. Several of the critical darlings (Citizen Kane, The Godfather, Singin’ in the Rain, and North by Northwest) also rank high on the RT list while others (many of them from the American New Wave period of the 1970s) show a drop in popularity. Meanwhile, other classically popular films like The Wizard of Oz and ET: The Extra-Terrestrial float upward.

Anatomy of an Analysis (Part 1)

A few weeks ago, the BBC News produced a list of the top 100 greatest American films based on input from critics from around the world.

Here are the top ten films presented in rank order:

  1. Citizen Kane (Orson Welles, 1941)
  2. The Godfather (Francis Ford Coppola, 1972)
  3. Vertigo (Alfred Hitchcock, 1958)
  4. 2001: A Space Odyssey (Stanley Kubrick, 1968)
  5. The Searchers (John Ford, 1956)
  6. Sunrise (FW Murnau, 1927)
  7. Singin’ in the Rain (Stanley Donen and Gene Kelly, 1952)
  8. Psycho (Alfred Hitchcock, 1960)
  9. Casablanca (Michael Curtiz, 1942)
  10. The Godfather Part II (Francis Ford Coppola, 1974)

There is really nothing too surprising here. Perennial favorite Citizen Kane tops the list followed by The Godfather and Vertigo — two of the most famous films (by two of the most famous directors) ever produced. Perusing the full list, you might recognize a few other titles and maybe think about adding some of them to your Netflix queue. But that’s about it. Aside from a handful of ancillary stories, there was little additional commentary to draw you deeper into the story. Sensing an opportunity, I decided to use this list to demonstrate the steps involved in a quick and simple analysis of data found “in the wild.”

Here follows a demonstration of my 5-step program for data analysis:


The BBC asked each critic to submit a list of the ten films they felt were the greatest in American cinema (“… not necessarily the most important, just the best.”). For the project, an “American film” was defined as any movie that received funding from a U.S. source. This criteria included many films by foreign directors as well as films shot outside of the country. The highest ranking films on each list received ten points, the next film down received nine points, and so on. The tenth pick received one point. All the points were then tallied to produce the final list.


Even though the resulting “listicle” is fairly simple, it contains a lot of interesting information just waiting to be freed from its icy confines. I pulled the list into Excel and used some very basic string (text) manipulation to create four basic fields from each row of information:

Additional manipulation of the “Year” field yields a useful grouping category:

With the creation of these five fields, I now have a flexible database instead of a rigid list.


The final data set was “stored” as a table in a simple spreadsheet. Although I have many problems using Excel for data storage (more on that in a future post), it is a quick and easy way to organize small sets of data.


Once the data was in the format I wanted, I created a pivot table that allowed me to manipulate information in different ways. I was particularly interested in answering questions like “Who are the top directors?”, “When were most of these films made?”, and “Was there ever a ‘Golden Age’ of modern cinema?” Most of these questions can be answered through simple grouping and summarization.


After all that work, it’s time to pull together the results and display them in some way. For this exercise, that means a few simple tables and charts:


Director # of Films in the Top 100
Stanley Kubrick 5
Steven Spielberg 5
Alfred Hitchcock 5
Billy Wilder 5
Francis Ford Coppola 4
Howard Hawks 4
Martin Scorsese 4
John Ford 3
Orson Welles 3
Charlie Chaplin 3


Year # of Films in the Top 100
1975 5
1980 4
1974 4
1959 4
1939 3
1941 3
1977 3
1946 3
1994 3


These simple presentation tools start to tell some interesting stories and — like all good analysis tools — start to hint at additional avenues of exploration. For example, while two of the directors with five films in the Top 100 (Kubrick, Hitchcock) also made it into the Top 10, the other two (Spielberg and Wilder) did not … why? The year with the most films on the list was 1975 … what were the films? The 1970s account for over 20% of the films on the list … what was going on in the culture that lead to this flowering of expression?

It would have been really great if the BBC article had included some sort of interactive tool that allowed readers to explore the database themselves. I will see what I can do to tackle this in an upcoming post.

Trends in NFL Football Scores (Part 1)

One of the goals I set for myself this summer was to learn a bit about D3, a visualization toolkit that can be used to manipulate and display data on the web. Considering that the trees are bare and we’ve already had our first frost here in Wisconsin, you can safely assume that I am behind schedule. Nevertheless, I feel that I’ve finally reached a point where I have something to publish, so here goes.

First of all, a little background. D3 is a JavaScript library that allows you to bind data to any of the elements (text, lines and shapes) you might normally find on a web page.  These objects can be stylized using CSS and animated using simple dynamic functions. These features make D3 a perfect tool for creating interactive charts and graphs without having to depend on third party programs like Google Charts, Many Eyes or Tableau.

I wanted to start out with something simple so I elected to go with a basic line chart using data I pulled from Pro-Fooball-Reference.com. This site contains a ton of great information and statistics from the past 90+ years of the National Football League but — for now — I just looked at the final scores of all the games played from 1920 to 2011. My first D3-powered chart is below. It shows the average combined scores of winning and losing teams for each year of the NFL’s existence.

Although this chart looks pretty simple, every element — including titles, subtitles, axes, labels, grids and data lines — has been created manually using the D3 code. The payoff is pretty nice. All of the elements can be reused and you have tremendous control over what is shown onscreen. To demonstrate some of these cababilities, I’ve added interactive overlays that show a few of the major eras in NFL football (derived from work of David Neft and this discussion thread). If you move your mouse over the graph, you will see these different eras highlighted:

Early NFL (1920-1933) – The formation of the American Professional Football Association (APFA) in 1920 marked the official start of what was to become the National Football League. This era was marked by rapid formation (and dissolution) of small town franchises, vast differences in team capabilities and a focus on a relatively low-scoring running game. At this time, the pass was considered more of an emergency option than a reliable standard. The rapid growth in popularity of the NFL during this era culminated with the introduction of a championship game in 1932.

Introduction of the Forward Pass (1933-1945) – The NFL discontinued the use of collegiate football rules in 1933 and began to develop its own set of rules designed around a faster-paced, higher-scoring style of play. These innovations included the legalization of the forward pass from anywhere behind the line of scrimmage — a change that is often called the  “Bronko Nagurski Rule” after his controversial touchdown in the 1932 NFL Playoff Game.

Post-War Era (1945-1959) – The end of WWII saw the expansion of the NFL beyond its East Coast and Midwestern roots with the move of the Cleveland Rams to Los Angeles — the first big-league sports franchise on the West Coast. This period also saw the end of racial segregation (enacted in the 30s) and the start of nationally televised games.

Introduction of the AFL (1959-1966) – Professional football’s surge in popularity led to the formation of a rival organization — the American Football League — in 1960. The growth of the flashy AFL was balanced by a more conservative style of play in the NFL. This style was epitomized by coach Vince Lombardi and the Green Bay Packers, who would win five championships in the 1960s. In 1966, the two leagues agreed to merge as of the 1970 season.

Dead Ball Era (1966-1977) – Driven in part by stringent restrictions on the offensive line, this period is marked by low scores and tough defensive play. Teams that thrived in this environment include some of the most famous defenses in modern NFL history: Pittsburgh’s Steel Curtain, Dallas’ Doomsday Defense, Minnesota’s Purple People Eaters and the Rams’ Fearsome Foursome.

Live Ball Era (1978-present) – Frustrated by the decreasing ability of offenses to score points in 70s, the NFL began to add rules and make other changes to the structure of the game in an attempt to boost scoring. The most famous of these initiatives was the so-called “Mel Blount Rule” (introduced in 1978), which severely restricted the defense’s ability to interfere with passing routes. With the subsequent introduction of the West Coast Offense in 1979 — an offense based on precise, short passes — this period became marked by a major focus on the passing game.

Having created this first chart, I decided to build a second chart based on the ratio of average winning scores to average losing scores to see if there were any patterns.

The chart above shows how — after a period of incredibly lopsided victories — the average scoring differential settled in to a very steady pattern by the late 1940s and stayed at that level (roughly 2:1) for the next 30 years. Despite many changes in rules, coaching techniques, technology and other factors, only the pass interference rules of the late 1970s seemed to have any signifcant effect on this ratio, shifting it to just under 1.8:1 for the next 30 years.

While I had the data available, I also decided to look at the differences in average scores between home teams and away teams. The chart below plots this data along with the same overlay I used in the first chart.

A look at the ratio of average home team scores to average away team scores follows:

What’s fascinating about this chart is how quickly a form of parity was acheived among all the NFL teams. By the mid-30s, a measurable home field advantage can be seen at roughly 15%, a rate that has remained essential constant for over 70 years. Factors for this boost could include the psychological support of fans, familiar weather conditions, unique features of local facilities, lack of travel fatigue, referee bias and/or increased levels of motivation in home town players.

Thanks to Charles Martin Reid for his solution to getting D3 and WordPress to play nice.

Family Pool Trends

With the start of a rather warm, dry summer here in Wisconsin, we’ve decided to take the plunge (literally) and purchase a new above ground pool. We seem to outgrow these things every few years and I’ve become intrigued with the idea that we just keep buying larger and large cylinders of water. After some exhaustive research (which mostly involved looking through a lot of old photographs and estimating pool sizes), I present you with a timeline of our family’s pool history. The bubbles represent surface area and allow for relative size comparison.

Pool History (1997-2012)

Bubble Size = Pool Size

It is interesting to see that — with the exception of a few strays — we seem to buy a new pool every three years. It is also interesting to see the exponential growth in water volume that began about the time my daughter was born (when my son was four). If we keep up that pace, our next pool will be over 10,000 cubic feet — about the size of two 18-wheelers full of water.

You Are What You Watch

Experian-Simmons released some survey data in December that looked at the relative popularity of major television shows for three different political groups: liberal Democrats; conservative Republicans; and middle-of-the-road voters. Each show was given an index based on the concentration of specific voters and this information was used to create lists of the top programs for each political group in both entertainment and news categories.

Although these top ten lists were interesting on their own, the fact that each individual TV program actually had an index rating for all three groups offers an opportunity for more complex analysis. The most obvious next step involves comparing pairs of groups in a 2D scatterplot chart. The Tableau visualization below shows the results.

A few notes:

  • Entertainment shows are in blue, news shows are in orange.
  • Shows without enough data for a particular group were still plotted as a zero index.
  • Hovering over each data point reveals the show and its indices.


The first thing I noticed was that news shows were much more partisan than entertainment shows. In fact, almost all of the shows with the most extreme scores were either news shows (primarily FOX and MSNBC) or fake news shows (Comedy Central’s Daily Show and Colbert Report). PBS gets a few high scores on the liberal side but the standard television networks are all fairly evenly watched.

Another thing that strikes me is how similar the watching habits of middle-of-the-road voters are to those of conservatives Republicans. The only noticeable exception occurs with news programs, but it is a pretty big exception: FOX News. All of the top ten conservative news programs were all on FOX while none of the top middle-of-the-road news programs were on that network. It might be encouraging for conservative politicians to see the similarities in entertainment interests between conservative voters and independents but I suspect that the gulf in news sources would be hard to overcome.

Many of the other differences have been noted elsewhere but are worth repeating: liberal Democrats tend to favor funnier shows and stories involving morally complex characters while conservative Republicans favor shows where people are doing stuff — either real work or reality competitions.

Of course, having complained about the lack of 2D analysis for this data in the major online outlets, I would be remiss if I didn’t point out the fact that each show has three indices apiece. Logically, we should be trying to show the data in a 3D scatterplot.

This isn’t as easy as it sounds since most of the major charting applications aren’t very good in 3D and they don’t provide any interactive option for the web that I could find. The best options seemed to be R or something called CanvasXpress — neither of which I had worked with before. I chose R, which allowed me to create both static and interactive 3D plots. However, only screenshots of the interactive plot are available at the moment. Several hours later …

What ‘The Office’ Gets Wrong About the Office

I start a new job next week and so I’ve been working on documenting all of my old tasks and projects in preparation for the transition. As I was going through old e-mails, I came across the introductory note my manager sent out to the department on my first day back in June 2004. Comparing it to the departure e-mail from my current manager, it’s amazing to see the changes in personnel over a seven-year span.

I prepared this chart using the distribution list from both e-mails, a drawing program, and a site that creates proportional venn diagrams. Only eight people are listed twice — including me and a person who left the company and has since returned. Some of the people who are only listed once have more tenure then me — they just may have gone to/come from another department. Still, it represents an interesting fact about the modern office. Change is constant.


Now that we can safely say that Brett Favre has retired (notwithstanding rumors to the contrary), I thought it was time to pull out some data on the indecisive quarterback’s career touchdown passes. Stats on passes say a lot about the relationship between a quarterback and his receivers so I wanted to create a visual that captured some of these stories.

The chart below shows each touchdown pass that Brett Favre threw during his NFL career and displays it up by receiver (vertical axis), season (horizontal axis), average yardage per month (size of marker), and team (color of marker).

Packer Fans will immediately recognize the significance of some of the data points. For the rest of you, here are a few highlights:

  • Sterling Sharpe caught Brett Favre’s first touchdown pass as a Green Bay Packer in 1992 and continued to be the quarterback’s primary receiver for the next three years. The 5x All-Pro led the NFL in touchdown receptions in both 1992 and 1994 and would certainly have played a major role in the team’s subsequent success if he hadn’t suffered a career-ending neck injury at the end of the 1994 season.
  • Following Sharpe’s early exit from football, Favre was forced to distribute his passes among a broader range of players, chief among them wide receivers Robert Brooks and Antonio Freeman. These two players would serve as the primary pillars of the passing game throughout Favre’s most successful period with Green Bay.
  • During the 1996 season (the year the Packers won Super Bowl XXXI), Favre threw touchdowns to ten different receivers, a career high. His total touchdown pass yardage that year also reached a high water mark.
  • Following Favre’s two Super Bowl appearances, there was a noticeable dropoff in the number of new players catching touchdowns. It is not clear whether it was because the receiving core had stabilized or the coaches were focused on developing other aspects of the team but there were no fresh faces in the 1998 season and only two (Corey Bradford and Donald Driver) in 1999.
  • Favre did not have another pair of favorite “big play” receivers until his last two seasons with the Packers, when he had both Driver and Greg Jennings.
  • After Favre’s retirement from the Packers, he was introduced to an entirely new slate of receivers with the New York Jets in 2008. This situation was repeated in 2009 when he signed up with the Minnesota Vikings. He threw his final touchdown pass to Percy Harvin in December 2010.