Anatomy of an Analysis (Part 2) – The Enrichening

In the first part of this analysis, I turned a short list of movies into a database that could be used to answer basic questions about the list’s contents. Now I’d like to broaden this analysis by combining the original list with additional outside information — a process called data enrichment.

First, I needed to find and process a new set of data. In this case, I chose a list of the Best Movies of All Time compiled by popular film review aggregator, Rotten Tomatoes because I thought it might include movies that were more popular with a general audience. The RT list ranks movies by their adjusted Tomatometer rating (as of mid-August 2015) and pulls out the top 100. I copied this list over to a spreadsheet and created fields for Rank, Film, Year, and Decade.

Once this information was ready, I used the name of the movie itself to join the RT list to the original BBC list. This approach, while perfectly reasonable, does come with a certain level of risk because the two sources do not always match perfectly. When that happens you have to match the information by hand. Can you spot the problems associated with each pair of names below?

Best Movies (RT) Greatest American Films (BBC)
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb Dr Strangelove
E.T. the Extra-Terrestrial ET: The Extra-Terrestrial
It’s a Wonderful Life It’s a Wonderful Life
One Flew Over the Cuckoo’s Nest One Flew Over the Cuckoo’s Nest
Schindler’s List Schindler’s List
The Godfather Part II The Godfather Part II

The first mismatch is pretty obvious because Rotten Tomatoes includes the full tagline for the movie Dr. Strangelove in the title while the BBC does not. However, there are also some subtle differences in punctuation (such as the period after the abbreviation of “doctor” in the first column) that would still cause problems during a join. These punctuation issues show up more clearly with the second item which has differences in both the abbreviation of “E.T.” and the inclusion of a colon in the BBC version of the title. It gets more subtle from there!

The next three movie titles all contain a contraction or a possessive noun but one source uses an apostrophe while the other uses a single quotation mark. (To make this problem even harder to spot, some web browsers render them both the same. Check the page source.). Finally, the last paired items look identical … except that the first listing of The Godfather Part II includes a trailing space. Pretty esoteric, I know, but that is life in the data world.

With the two data sources aligned, I then created my final enhanced database and explored the information using another pivot table.
The first thing I noticed when I compared the BBC list to the Rotten Tomatoes list is that they only had 22 films in common. This surprised me a little at first but it makes sense when you realize that the RT list is not limited to American films. It also seemed to support my initial instinct that the RT database would contain many more recent films due to its online format.

TOP DECADES IN FILM (BBC vs. Rotten Tomatoes)
Top_Film_Decades_2

A quick look at the films by source and decade (above) shows a huge number of recent films in the RT listing (including one, Mad Max: Fury Road, that was still in theaters when I first downloaded the data). It is also interesting to note that the spike in “best” movies for Rotten Tomatoes occurs in the 1950s instead of the 1970s. However, the large number of foreign films in the RT list for the 1950s leads quickly to discussions of Japan’s “Golden Age” of cinema during that time period.

RANK COMPARISONS OF AMERICAN FILM (BBC vs. Rotten Tomatoes)
Rank_Comparisons_1

Another interesting view of this information can be seen when you compare the two ranked lists side-by-side. The chart above shows the 22 films that appear on both lists with a line connecting their two ranks. This makes it easy to see where the sources agree and where they disagree. Several of the critical darlings (Citizen Kane, The Godfather, Singin’ in the Rain, and North by Northwest) also rank high on the RT list while others (many of them from the American New Wave period of the 1970s) show a drop in popularity. Meanwhile, other classically popular films like The Wizard of Oz and ET: The Extra-Terrestrial float upward.

Anatomy of an Analysis (Part 1)

A few weeks ago, the BBC News produced a list of the top 100 greatest American films based on input from critics from around the world.

Here are the top ten films presented in rank order:

  1. Citizen Kane (Orson Welles, 1941)
  2. The Godfather (Francis Ford Coppola, 1972)
  3. Vertigo (Alfred Hitchcock, 1958)
  4. 2001: A Space Odyssey (Stanley Kubrick, 1968)
  5. The Searchers (John Ford, 1956)
  6. Sunrise (FW Murnau, 1927)
  7. Singin’ in the Rain (Stanley Donen and Gene Kelly, 1952)
  8. Psycho (Alfred Hitchcock, 1960)
  9. Casablanca (Michael Curtiz, 1942)
  10. The Godfather Part II (Francis Ford Coppola, 1974)

There is really nothing too surprising here. Perennial favorite Citizen Kane tops the list followed by The Godfather and Vertigo — two of the most famous films (by two of the most famous directors) ever produced. Perusing the full list, you might recognize a few other titles and maybe think about adding some of them to your Netflix queue. But that’s about it. Aside from a handful of ancillary stories, there was little additional commentary to draw you deeper into the story. Sensing an opportunity, I decided to use this list to demonstrate the steps involved in a quick and simple analysis of data found “in the wild.”

Here follows a demonstration of my 5-step program for data analysis:

Source

The BBC asked each critic to submit a list of the ten films they felt were the greatest in American cinema (“… not necessarily the most important, just the best.”). For the project, an “American film” was defined as any movie that received funding from a U.S. source. This criteria included many films by foreign directors as well as films shot outside of the country. The highest ranking films on each list received ten points, the next film down received nine points, and so on. The tenth pick received one point. All the points were then tallied to produce the final list.

Processing

Even though the resulting “listicle” is fairly simple, it contains a lot of interesting information just waiting to be freed from its icy confines. I pulled the list into Excel and used some very basic string (text) manipulation to create four basic fields from each row of information:
List_String_Manipulation_1

Additional manipulation of the “Year” field yields a useful grouping category:
List_String_Manipulation_2

With the creation of these five fields, I now have a flexible database instead of a rigid list.

Organization

The final data set was “stored” as a table in a simple spreadsheet. Although I have many problems using Excel for data storage (more on that in a future post), it is a quick and easy way to organize small sets of data.

Transformation

Once the data was in the format I wanted, I created a pivot table that allowed me to manipulate information in different ways. I was particularly interested in answering questions like “Who are the top directors?”, “When were most of these films made?”, and “Was there ever a ‘Golden Age’ of modern cinema?” Most of these questions can be answered through simple grouping and summarization.

Serve

After all that work, it’s time to pull together the results and display them in some way. For this exercise, that means a few simple tables and charts:

TOP 10 DIRECTORS IN AMERICAN FILM

Director # of Films in the Top 100
Stanley Kubrick 5
Steven Spielberg 5
Alfred Hitchcock 5
Billy Wilder 5
Francis Ford Coppola 4
Howard Hawks 4
Martin Scorsese 4
John Ford 3
Orson Welles 3
Charlie Chaplin 3

TOP YEARS IN AMERICAN FILM

Year # of Films in the Top 100
1975 5
1980 4
1974 4
1959 4
1939 3
1941 3
1977 3
1946 3
1994 3

TOP DECADES IN AMERICAN FILM
Top_Film_Decades_1

These simple presentation tools start to tell some interesting stories and — like all good analysis tools — start to hint at additional avenues of exploration. For example, while two of the directors with five films in the Top 100 (Kubrick, Hitchcock) also made it into the Top 10, the other two (Spielberg and Wilder) did not … why? The year with the most films on the list was 1975 … what were the films? The 1970s account for over 20% of the films on the list … what was going on in the culture that lead to this flowering of expression?

It would have been really great if the BBC article had included some sort of interactive tool that allowed readers to explore the database themselves. I will see what I can do to tackle this in an upcoming post.

Jon Stewart on Misinformation

I just finished watching Jon Stewart’s final episode of The Daily Show and I was glad see that his parting speech addressed the topic of misinformation (aka: bullshit) and how to recognize it. The following is a rough transcript:

TRANSCRIPT: Jon Stewart delivers speech on “Bulls–t” during his final episode hosting “The Daily Show” Wednesday night on Comedy Central

Welcome back! Anyway, about the debate. I don’t have anything for you.

We’ve seen the correspondents. We’ve met everyone who works here. And now I feel like I should probably say something. So maybe one last time, maybe a little — if you want to — maybe a little camera three.

Bullshit is everywhere.

Are the kids still here? We’ll deal with that later.

Bullshit is everywhere. There is very little you will encounter in life that has not been, in some ways, infused with bullshit — not all of it bad. General day-to-day free range is often necessary, or at least innocuous: “Oh, what a beautiful baby. I’m sure he’ll grow into that head.” That kind of bullshit in many ways provides important social-contract fertilizer and keeps people from make each other cry all day. But then there’s the more pernicious bullshit, your premeditated institutional bullshit designed to obscure and distract. Designed by whom? The bullshitacracy.

It comes in three basic flavors.

One, making bad things sound like good things. “Organic, all-natural cupcakes” … because factory made sugar oatmeal balls doesn’t sell. “Patriot Act” … because “Are You Scared Enough to Let Me Look at All Your Phone Records Act” doesn’t sell. Whenever something is titled freedom, fairness, family, health, and America, take a good long sniff. Chances are it’s been manufactured in a facilitate that may contain traces of bullshit.

Number two, the second way, hiding the bad things under mountains of bullshit. Complexity — you know, I would love to download Drizzy’s latest Meek Mill diss. (Everyone promised me that that made sense.) But I’m not really interested right now in reading Tolstoy’s iTunes agreement, so I’ll just click “agree” even if it grants Apple prima noctae with my spouse. Here’s another one — simply put, banks shouldn’t be able to bet your pension money on red. Bullshitly put, it’s — hey, this. Dodd-Frank. Hey, a handful of billionaires can’t buy our elections, right? Of course not. They can only pour unlimited anonymous cash into a 501(c)4 other wise they’d have to 501(c)6 it, or funnel it openly through a non-campaign coordinated Super PAC … “I think they’re asleep now. We can sneak out.”

And finally — finally, it’s the Bullshit of infinite possibility. These bullshitters cover their unwillingness to act under the guise of unending inquiry. We can’t do anything because we don’t yet know everything. We cannot take action on climate change until everyone in the world agrees gay marriage vaccines won’t cause our children to marry goats who are going to come for our guns. Until then, I say teach the controversy.

Now, the good news is this– bullshitters have gotten pretty lazy, and their work is easily detected. And looking for it is a pleasant way to pass the time like an “I Spy” of bullshit. I say to you tonight friends the best defense against bullshit is vigilance. So if you smell something, say something.

Thanks for everything, Mr. Stewart. We couldn’t have made it through these last 16 years without you.

What Does the Average Biker Dude Look Like?

CNN recently published the mugshots of all of the biker gang members who were arrested after the recent shootout in a Waco, TX restaurant. There were a total of 171 pictures, all in the same standard pose and most in the same standard orange jumpsuit:

Mugshot_Matrix

Seeing all of these pictures got me wondering if it was possible to create an image that represented the typical gang member. I had seen a technique called “pixel averaging” applied to a series of wedding photos many years ago and I was able to find a tool called ImageJ that helped me with the processing.

It was a fairly straightforward effort … just upload the individual photos into an image stack and then apply a z-filter to each pixel of each “slice” or picture in the stack. The result is as follows:

Mugshot_Pixel_Avg

Despite a wide variety of ages, facial hair, and ethnic backgrounds in the mugshots, the combined image looks like some guy you might see at a weekend softball game. It certainly gives no indication of how dangerous some of these men (and women) can be.

Updates:

  • Snow Days and the Flow of Information

    Mood music for this post: Snow Day by Trip Shakespeare

    Friday was another snow day here in Wisconsin and I found it interesting to see how this piece of information wound its way through the various layers of the community. (For the uninitiated, a snow day is a weather-related closing of the local school system. They often involve heavy amounts of snowfall and a euphoric sense of good fortune on the part of students.)

    The classic, pre-Internet, pre-mobile method of disseminating this information to the public usually started with the school superintendent contacting area TV and radio stations on the day of the closing. The media would then run regular updates on their morning programs which would, in turn, be viewed or heard by parents getting ready for work. The kids themselves were at the bottom of this vertical flow of information, either hearing it directly from their parents or through their parent’s media choices.

    These days the school superintendent often contacts their IT director before the media because this person is hooked into a distributed network of communication technology that bypasses the traditional information hierarchy. They might set off a robocall that contacts families and employees and then post a message to social media sites with a much larger potential audience. Now kindergarteners following the district’s Twitter feed have the ability to hear about school closings at the same time as the local TV manager.

    This flattening of the information hierarchy was made readily apparent to me after dinner on Thursday. My son was in the other room doing his homework when the robocall came in announcing the closing. My wife — hoping to withhold this information until he finished — didn’t say anything at first. However, within seconds he had already received a text from a buddy with the good news. A victory for the little guy and a tremendous example of the democratization of information.

    Data Literacy 101: What is Data?

    Whenever the topic of data comes up at meetings or informal conversations it doesn’t take long for people’s eyes to glaze over. The subject is usually considered so complex and esoteric that only a few technically-minded geeks find value in the details. This easy dismissal of data is a real problem in the modern business world because so much of what we know about customers and products is codified as information and stored in corporate databases. Without a high level of data literacy this information sits idle and unused.

    One way I try to get people more interested in data is to make a distinction between data management and data content. In its broadest sense, data management consists of all the technical equipment, expertise, security procedures, and quality control measures that go into riding herd on large volumes of data. Data content, on the other hand, is all the fun stuff that is housed and made accessible by this infrastructure. To put it another way, think of data management as a twisty, mountain road built by skilled engineers and laborers while data content is the Ferrari you get to drive on it.

    Okay, maybe that’s taking it a bit too far. Stick with me.

    At its most basic, data is simply something you want to remember (a concept I borrowed from an article by Rob Karel). Examples might include:

    • Your home address
    • Your mom’s birthday
    • Your computer password
    • A friend’s phone number
    • Your daughter’s favorite color

    You could simply memorize this information, of course, but human memory is fragile and so we often collect personally meaningful information and store it in “tools” like calendars, address books, spreadsheets, databases, or even paper lists. Although this last item might not seem like a robust data storage method it is a good introduction to some basic data concepts. (I’ve talked about the appeal of “Top 10” lists as a communication tool in a previous post but I didn’t really address their specific structure.)

    Let’s start with a simple grocery list:

    Data101_List_1

    Believe it or not, this is data. A list like this has a very loose data format consisting of related items separated by some sort of “delimiter” like a comma or — in this case — a new line on our fake note pad. You can add or subtract items from the list, count the total number of items, group items into categories (like “dairy” or “bakery”), or sort items by some sequence. Many of you will have created similar lists because they are great external memory aids.

    The problem with this list is that it is very generalized. You could give this grocery list to ten different people and get ten different results. How many eggs do you want? Do you want whole milk, 2%, or fat free? What type of bread do you want? What brand of peanut butter do you like?

    This list really only works for you because a memory aid works in concert with your own personal circumstances. If someone doesn’t share that context then the content itself doesn’t translate very well. That’s okay for “to do” lists or solo trips to the grocery store but doesn’t work for a system that will be used by multiple people (like a business). In order to overcome this barrier you have to add specificity to your initial list.

    Data101_List_2

    This is a grocery list that I might hand over to my teenage son. It is more specific than the first list and has exact amounts and other additional details that he will need to get the order right. Notice, however, that there is a cost for this increased level of specificity, with the second list containing over four times as many characters as the first one. At the same time, this list still lacks key attributes that would help clarify the request for non-family members.

    If we are going to make this list more useful to others, we need to continue to improve its specificity while making it more versatile. One way to do this is to start thinking about how we would merge several grocery lists together.

    Data101_List_3

    Here is our original list stacked on top of a second list of similar items. I’ve added brand names to both of them and included a heading above each list with the name of the list’s owner. The data itself is still “unstructured”, however, meaning it is not organized in any particular way. This lack of structure doesn’t necessarily interfere with our goal of buying groceries but it does limit our ability to organize items or find meaningful patterns in the data. As our list grows this problem is compounded. Eventually, we’ll need to find some way of introducing structure to our lists.

    Data101_List_4

    One step we can take is to break up our list entries and put the individual pieces into a table. A table is an arrangement of rows and column where each row represents a unique item, while each column or “field” contains elements of the same data “type.” For this first example, I’ve created three columns: a place for a “customer” name (the text of the list’s owner), an item count (a number), and the item itself (more text). Notice that the two lists are truly merged, allowing us to sort items if we want.

    Data101_List_4_sorted

    Sorting makes it a bit easier to pick out similar items, which will help a little on our fictitious shopping trip. However, we still have a problem. Some of the items (like the milk, butter, and peanut butter) are sorted by the size criteria listed in the unstructured text, which makes it harder to see that some of things can be found in same aisle. Adding new fields will help with this.

    Data101_List_5_sorted

    By adding separate columns for brand name and size, the data in the “item” column is actually pretty close to our first list. All the additional detail are included in new fields that are clearly defined and contain similar data. We’ve had to clean up a few labeling issues (such as “skim milk” vs. “fat free milk”) but these are relatively minor data governance issues. Our final, summarized list is ready for prime time.

    Data101_List_6_Summary

    And that, my friend, is how data is made.

    A 12-Year-Old’s Take on Data and Analysis

    I asked my daughter what she thought data was last night. Her text to me at 11:45 pm:

    Okay, so you asked me what data was at the motorcycle thing today and at first, I didn’t really think about that much I just said an answer if what I thought it was. I was rethinking that and the answer I have didn’t sound right now. So, I do what most people would do and that was look it up in the dictionary, what the definition was was this:

    Data_definition_Anna

    And it was really boring. Even though it was true it sounded like the color gray or beige and we both know that those are the boring colors. So, if you now start to think of data as a color or colors, the colors I come up with are the primary colors and the three colors that those primary colors make. Data is colors at this point, and what you’re trying to achieve is a painting. For an example, if you asked 5 different people in 5 different stages of their life what their favorite cereal was, the answers those people would give you are the pastels or paints that they have given you and you are the artist at this point. You have to create the painting. Now, I would infer that the people in the later stages of their life they might say the boring cereals made of cardboard and unhappiness, while people in early stages of their life would most likely say one of the sugary cereals. The picture you would make would say to the people looking at it that old people are boring and kids have better breakfast cereals. This may have been a reasonable explanation or it might’ve been crap but what the fluff. It’s almost 11:45 so don’t blame me. P.S. Tell the people to get a tumblr :)

    I get it now!