What Does the Average Biker Dude Look Like?

CNN recently published the mugshots of all of the biker gang members who were arrested after the recent shootout in a Waco, TX restaurant. Seeing all of these 171 pictures got me wondering if it was possible to create an image that represented the typical gang member. I had seen a technique called “pixel averaging” applied to a series of wedding photos many years ago and I was able to find a tool called ImageJ that helped me with the processing.

It was a fairly straightforward effort … just upload the individual photos into an image stack and then apply a z-filter to each pixel of each “slice” or picture in the stack. The result is as follows:

Mugshot_Pixel_Avg

Despite a wide variety of ages, facial hair, and ethnic backgrounds in the mugshots, the combined image looks like some guy you might see at a weekend softball game. It certainly gives no indication of how dangerous these men can be.

Snow Days and the Flow of Information

Mood music for this post: Snow Day by Trip Shakespeare

Friday was another snow day here in Wisconsin and I found it interesting to see how this piece of information wound its way through the various layers of the community. (For the uninitiated, a snow day is a weather-related closing of the local school system. They often involve heavy amounts of snowfall and a euphoric sense of good fortune on the part of students.)

The classic, pre-Internet, pre-mobile method of disseminating this information to the public usually started with the school superintendent contacting area TV and radio stations on the day of the closing. The media would then run regular updates on their morning programs which would, in turn, be viewed or heard by parents getting ready for work. The kids themselves were at the bottom of this vertical flow of information, either hearing it directly from their parents or through their parent’s media choices.

These days the school superintendent often contacts their IT director before the media because this person is hooked into a distributed network of communication technology that bypasses the traditional information hierarchy. They might set off a robocall that contacts families and employees and then post a message to social media sites with a much larger potential audience. Now kindergarteners following the district’s Twitter feed have the ability to hear about school closings at the same time as the local TV manager.

This flattening of the information hierarchy was made readily apparent to me after dinner on Thursday. My son was in the other room doing his homework when the robocall came in announcing the closing. My wife — hoping to withhold this information until he finished — didn’t say anything at first. However, within seconds he had already received a text from a buddy with the good news. A victory for the little guy and a tremendous example of the democratization of information.

Data Literacy 101: What is Data?

Whenever the topic of data comes up at meetings or informal conversations it doesn’t take long for people’s eyes to glaze over. The subject is usually considered so complex and esoteric that only a few technically-minded geeks find value in the details. This easy dismissal of data is a real problem in the modern business world because so much of what we know about customers and products is codified as information and stored in corporate databases. Without a high level of data literacy this information sits idle and unused.

One way I try to get people more interested in data is to make a distinction between data management and data content. In its broadest sense, data management consists of all the technical equipment, expertise, security procedures, and quality control measures that go into riding herd on large volumes of data. Data content, on the other hand, is all the fun stuff that is housed and made accessible by this infrastructure. To put it another way, think of data management as a twisty, mountain road built by skilled engineers and laborers while data content is the Ferrari you get to drive on it.

Okay, maybe that’s taking it a bit too far. Stick with me.

At its most basic, data is simply something you want to remember (a concept I borrowed from an article by Rob Karel). Examples might include:

  • Your home address
  • Your mom’s birthday
  • Your computer password
  • A friend’s phone number
  • Your daughter’s favorite color

You could simply memorize this information, of course, but human memory is fragile and so we often collect personally meaningful information and store it in “tools” like calendars, address books, spreadsheets, databases, or even paper lists. Although this last item might not seem like a robust data storage method it is a good introduction to some basic data concepts. (I’ve talked about the appeal of “Top 10″ lists as a communication tool in a previous post but I didn’t really address their specific structure.)

Let’s start with a simple grocery list:

Data101_List_1

Believe it or not, this is data. A list like this has a very loose data format consisting of related items separated by some sort of “delimiter” like a comma or — in this case — a new line on our fake note pad. You can add or subtract items from the list, count the total number of items, group items into categories (like “dairy” or “bakery”), or sort items by some sequence. Many of you will have created similar lists because they are great external memory aids.

The problem with this list is that it is very generalized. You could give this grocery list to ten different people and get ten different results. How many eggs do you want? Do you want whole milk, 2%, or fat free? What type of bread do you want? What brand of peanut butter do you like?

This list really only works for you because a memory aid works in concert with your own personal circumstances. If someone doesn’t share that context then the content itself doesn’t translate very well. That’s okay for “to do” lists or solo trips to the grocery store but doesn’t work for a system that will be used by multiple people (like a business). In order to overcome this barrier you have to add specificity to your initial list.

Data101_List_2

This is a grocery list that I might hand over to my teenage son. It is more specific than the first list and has exact amounts and other additional details that he will need to get the order right. Notice, however, that there is a cost for this increased level of specificity, with the second list containing over four times as many characters as the first one. At the same time, this list still lacks key attributes that would help clarify the request for non-family members.

If we are going to make this list more useful to others, we need to continue to improve its specificity while making it more versatile. One way to do this is to start thinking about how we would merge several grocery lists together.

Data101_List_3

Here is our original list stacked on top of a second list of similar items. I’ve added brand names to both of them and included a heading above each list with the name of the list’s owner. The data itself is still “unstructured”, however, meaning it is not organized in any particular way. This lack of structure doesn’t necessarily interfere with our goal of buying groceries but it does limit our ability to organize items or find meaningful patterns in the data. As our list grows this problem is compounded. Eventually, we’ll need to find some way of introducing structure to our lists.

Data101_List_4

One step we can take is to break up our list entries and put the individual pieces into a table. A table is an arrangement of rows and column where each row represents a unique item, while each column or “field” contains elements of the same data “type.” For this first example, I’ve created three columns: a place for a “customer” name (the text of the list’s owner), an item count (a number), and the item itself (more text). Notice that the two lists are truly merged, allowing us to sort items if we want.

Data101_List_4_sorted

Sorting makes it a bit easier to pick out similar items, which will help a little on our fictitious shopping trip. However, we still have a problem. Some of the items (like the milk, butter, and peanut butter) are sorted by the size criteria listed in the unstructured text, which makes it harder to see that some of things can be found in same aisle. Adding new fields will help with this.

Data101_List_5_sorted

By adding separate columns for brand name and size, the data in the “item” column is actually pretty close to our first list. All the additional detail are included in new fields that are clearly defined and contain similar data. We’ve had to clean up a few labeling issues (such as “skim milk” vs. “fat free milk”) but these are relatively minor data governance issues. Our final, summarized list is ready for prime time.

Data101_List_6_Summary

And that, my friend, is how data is made.

A 12-Year-Old’s Take on Data and Analysis

I asked my daughter what she thought data was last night. Her text to me at 11:45 pm:

Okay, so you asked me what data was at the motorcycle thing today and at first, I didn’t really think about that much I just said an answer if what I thought it was. I was rethinking that and the answer I have didn’t sound right now. So, I do what most people would do and that was look it up in the dictionary, what the definition was was this:

Data_definition_Anna

And it was really boring. Even though it was true it sounded like the color gray or beige and we both know that those are the boring colors. So, if you now start to think of data as a color or colors, the colors I come up with are the primary colors and the three colors that those primary colors make. Data is colors at this point, and what you’re trying to achieve is a painting. For an example, if you asked 5 different people in 5 different stages of their life what their favorite cereal was, the answers those people would give you are the pastels or paints that they have given you and you are the artist at this point. You have to create the painting. Now, I would infer that the people in the later stages of their life they might say the boring cereals made of cardboard and unhappiness, while people in early stages of their life would most likely say one of the sugary cereals. The picture you would make would say to the people looking at it that old people are boring and kids have better breakfast cereals. This may have been a reasonable explanation or it might’ve been crap but what the fluff. It’s almost 11:45 so don’t blame me. P.S. Tell the people to get a tumblr :)

I get it now!

Ideas Illustrated LLC Celebrates 10 Years

I was filling out the annual report forms for Ideas Illustrated LLC a few weeks ago and noticed that my original filing date was May 11, 2004… making today my 10th anniversary! It’s hard to believe that a full decade has passed since my wife and I sat around brainstorming ideas for a company. It’s been a fun ride so far, with a several great side projects, a well-regarded blog, and a lot of new challenges. It hasn’t made me a millionaire but it has put some extra cash in my pocket and probably saved my sanity on more than one occasion. Here’s to ten more years!

iiLOGO_white_10th_anniversa

A Force Node Diagram of the U.S. Interstate System

There’s nothing too complicated about this post. I’ve been interested in creating an illustration of the U.S. Interstate system for awhile but my initial concept of a “subway-style” diagram of the network had already been done. After some recent experimentation with the D3 Javascript library, I decided that it might be interesting to try out a simple force node display using the Interstate’s control cities as the nodes. Control cities are certain major destinations that are used to provide navigational guidance at key decision points along a particular route. It should be noted that not all control cities are actually cities and not all cities qualify as control cities. My starting list can be found here.

After my initial data collection, I found I had that I had to modify my approach to improve the network. First of all, I had to add some nodes for certain highway-to-highway connections, especially those that occurred in remote areas. I also had to include some cities that had multiple Interstate highways passing through because they weren’t always listed on each route. Finally, I added a few non-Interstate roads where I thought it made sense, including Alaska (which doesn’t actually have any Interstate highways) and eastern Canada, which has a major highway called the King’s Highway or Ontario Highway 401 linking Toronto and Montreal to key American cities.

Here is the result … click on the picture to get to a fully interactive version.

interstate_force_node_v2

The size of the nodes is related to the estimated population of the city/destination and the color represents Census division (plus Canada). You can kind of see a rough outline of the U.S., with the Midwest roughly in the center of the diagram (in orange) and the two coasts wrapping around on either side. Hawaii and Alaska float alone at the edge and the Florida penninsula (in the South Atlantic, in red) protrudes out toward the bottom of the chart.

Who’s Your Filter? (Nate Silver Edition)

“There are two ways to be fooled. One is to believe what isn’t true; the other is to refuse to believe what is true.” ~ Søren Kierkegaard

Back in 2010, I wrote a short post about some of the problems associated with getting all of your news and information from biased sources. It was essentially a call for people to hone their critical thinking skills and take steps toward establishing a more reality-based approach to decision-making.

Unfortunately, people don’t like challenging their existing beliefs very much because it can be pretty uncomfortable. They prefer sources of information that support their established worldviews and generally ignore or filter out those that don’t. In our modern society, this confirmation bias supports an entire ecosystem of publishers, news outlets, TV shows, bloggers, and radio announcers designed to serve up pre-filtered opinion disguised as fact.

For many people, the glossy veneer of the news entertainment complex is all they want or need. As David McRaney so succinctly states in his blog:

Whether or not pundits are telling the truth, or vetting their opinions, or thoroughly researching their topics is all beside the point. You watch them not for information, but for confirmation.

The problem with this approach is that — every now and then — fantasy runs into cold, hard reality and gets the sh*t kicked out of it.

This was what happened during the 2012 Presidential election cycle. Talking heads on both ends of the political spectrum had spent months trying to sway their audiences with confident declarations of victory and vicious denials of opposing statements. By the week of the election, the conservative media in particular had created such a self-reinforcing bubble of polls and opinions that any hints of trouble were shouted down and ignored. Pundits reserved particularly strong venom for statistician Nate Silver, whose FiveThirtyEight blog in the New York Times had upped the chances of an Obama win to a seemingly outrageous 91.4% the Monday before the election.

The furor reached its peak with Karl Rove’s famous on-air exchange with FOX news anchor, Megyn Kelly, and rippled through the conservative echo chamber after the polls closed. There was a lot of soul searching over the next few days, with many people taking direct aim at the conservative media for its failure to present accurate information to its audience. This frustration was summed up clearly by one commenter on RedState, a right-leaning blog:

“I can accept that my news is not really ‘news’ like news in Cronkite’s day, but a conservative take on the news. But it’s unacceptable that Rasmussen appears to have distinguished themselves from everyone else in their quest to shade the numbers to appease us, the base. I didn’t even look at other polls, to tell the truth, trusting that their methodology was more sound because it jived with what I was hearing on Fox and with people I talked to. It pains me to say this, but next time I want a dose of hard truth, I’m looking to Nate Silver, even if I don’t like the results.”

It was a teachable moment and Nate Silver — no fan of pundits — suggested that the fatal flaw in the approach taken by most of these political “experts” was that they based their forecasts less on evidence and more on a strong underlying ideology. Their core beliefs — “ideological priors” as Silver calls them — colored their views on everything and made it difficult to read such an uncertain situation correctly. It was time for something new.

In his book, The Signal and the Noise, Silver elaborates on the work of Philip Tetlock, who found that people with certain character traits typically made more accurate predictions than those without these traits. Tetlock identified these two different cognitive styles as either “fox” (someone who considers many approaches to a problem) or “hedgehog” (someone who believes in one Big Idea). There has been much debate about which one represents the best approach to forecasting but Tetlock’s research clearly favors the fox.

Tetlock’s ideas as summarized by Silver:

Fox-Like Characteristics Hedgehog-Like Characteristics
Multidisciplinary – Incorporates ideas from a range of disciplines Specialised – Often dedicated themselves to one or two big problems & are sceptical of outsiders
Adaptable – Try several approaches in parallel, or find a new one if things aren’t working Unshakable – New data is used to refine an original model
Self-critical – Willing to accept mistakes and adapt or even replace a model based on new data Stubborn – Mistakes are blamed on poor luck
Tolerant of complexity – Accept the world is complex, and that certain things cannot be reduced to a null hypothesis Order seeking – Once patterns are detected, assume relationships are relatively uniform
Cautious – Predictions are probabilistic, and qualified Confident – Rarely change or hedge their position
Empirical – Observable data is always preferred over theory or anecdote Ideological – Approach to predictive problems fits within a similar view of the wider world
Better Forecasters Weaker Forecasters

Nate Silver also prefers the fox-like approach to analysis and even chose a fox logo for the relaunch of his FiveThirtyEight blog. As befitting a fox’s multidisciplinary approach to problems, his manifesto for the site involves blending good old-fashioned journalism skills with statistical analysis, computer programming, and data visualization. (It is essentially a combination of everything we’ve been saying about data science + data-literate reporting.)

Nate Silver’s Four-Step Methodology for Data Journalism
This approach is very similar to the standard data science process.

  1. Data Collection – Performing interviews, research, first-person observation, polls, experiments, or data scraping
  2. Organization – Developing a storyline, running descriptive statistics, placing data in a relational database, or building a data visualization.
  3. Explanation – Performing traditional analysis or running statistical tests to look for relationships in the data.
  4. Generalization – Verifying hypotheses through predictions or repeated experiments.

Like data science, data journalism involves finding meaningful insights from a vast sea of information. And like data science, one of the biggest challenges to data-driven journalism is convincing people to actually listen to what the data is telling them. After FiveThirtyEight posted its prediction of a possible change in control of the Senate in 2014, Democrats have reacted with the same bluster as Republicans did back in 2012. At about the same time, economist Paul Krugman started a feud with Silver over — in my view — relatively minor journalistic differences. Meanwhile, conservatives gleeful at this apparent Leftie infighting continue to predict Silver’s ultimate failure because they still believe that politics is more art than science.

This seems to be a fundamental misunderstanding of what Silver and others like him are trying to do. Rather than look at how successful Silver’s forecasting methodology has been at predicting political results, most people seem to be treating him as just another pundit who has joined the political game. Lost in all of the fuss is his attempt to bring a little more scientific rigor to an arena that is dominated by people who generally operate on intuition and gut instinct. I’m certainly not trying to elevate statisticians and data journalists to god-like status here but it is my hope that people will start to recognize the value of unbiased evaluation and include it as one of their tools for gathering information. When it’s fantasy vs. reality, it is always better to be armed with the facts.

Update: