Tag Archives: Information Design

Data Literacy 101: What is Data?

Whenever the topic of data comes up at meetings or informal conversations it doesn’t take long for people’s eyes to glaze over. The subject is usually considered so complex and esoteric that only a few technically-minded geeks find value in the details. This easy dismissal of data is a real problem in the modern business world because so much of what we know about customers and products is codified as information and stored in corporate databases. Without a high level of data literacy this information sits idle and unused.

One way I try to get people more interested in data is to make a distinction between data management and data content. In its broadest sense, data management consists of all the technical equipment, expertise, security procedures, and quality control measures that go into riding herd on large volumes of data. Data content, on the other hand, is all the fun stuff that is housed and made accessible by this infrastructure. To put it another way, think of data management as a twisty, mountain road built by skilled engineers and laborers while data content is the Ferrari you get to drive on it.

Okay, maybe that’s taking it a bit too far. Stick with me.

At its most basic, data is simply something you want to remember (a concept I borrowed from an article by Rob Karel). Examples might include:

  • Your home address
  • Your mom’s birthday
  • Your computer password
  • A friend’s phone number
  • Your daughter’s favorite color

You could simply memorize this information, of course, but human memory is fragile and so we often collect personally meaningful information and store it in “tools” like calendars, address books, spreadsheets, databases, or even paper lists. Although this last item might not seem like a robust data storage method it is a good introduction to some basic data concepts. (I’ve talked about the appeal of “Top 10″ lists as a communication tool in a previous post but I didn’t really address their specific structure.)

Let’s start with a simple grocery list:

Data101_List_1

Believe it or not, this is data. A list like this has a very loose data format consisting of related items separated by some sort of “delimiter” like a comma or — in this case — a new line on our fake note pad. You can add or subtract items from the list, count the total number of items, group items into categories (like “dairy” or “bakery”), or sort items by some sequence. Many of you will have created similar lists because they are great external memory aids.

The problem with this list is that it is very generalized. You could give this grocery list to ten different people and get ten different results. How many eggs do you want? Do you want whole milk, 2%, or fat free? What type of bread do you want? What brand of peanut butter do you like?

This list really only works for you because a memory aid works in concert with your own personal circumstances. If someone doesn’t share that context then the content itself doesn’t translate very well. That’s okay for “to do” lists or solo trips to the grocery store but doesn’t work for a system that will be used by multiple people (like a business). In order to overcome this barrier you have to add specificity to your initial list.

Data101_List_2

This is a grocery list that I might hand over to my teenage son. It is more specific than the first list and has exact amounts and other additional details that he will need to get the order right. Notice, however, that there is a cost for this increased level of specificity, with the second list containing over four times as many characters as the first one. At the same time, this list still lacks key attributes that would help clarify the request for non-family members.

If we are going to make this list more useful to others, we need to continue to improve its specificity while making it more versatile. One way to do this is to start thinking about how we would merge several grocery lists together.

Data101_List_3

Here is our original list stacked on top of a second list of similar items. I’ve added brand names to both of them and included a heading above each list with the name of the list’s owner. The data itself is still “unstructured”, however, meaning it is not organized in any particular way. This lack of structure doesn’t necessarily interfere with our goal of buying groceries but it does limit our ability to organize items or find meaningful patterns in the data. As our list grows this problem is compounded. Eventually, we’ll need to find some way of introducing structure to our lists.

Data101_List_4

One step we can take is to break up our list entries and put the individual pieces into a table. A table is an arrangement of rows and column where each row represents a unique item, while each column or “field” contains elements of the same data “type.” For this first example, I’ve created three columns: a place for a “customer” name (the text of the list’s owner), an item count (a number), and the item itself (more text). Notice that the two lists are truly merged, allowing us to sort items if we want.

Data101_List_4_sorted

Sorting makes it a bit easier to pick out similar items, which will help a little on our fictitious shopping trip. However, we still have a problem. Some of the items (like the milk, butter, and peanut butter) are sorted by the size criteria listed in the unstructured text, which makes it harder to see that some of things can be found in same aisle. Adding new fields will help with this.

Data101_List_5_sorted

By adding separate columns for brand name and size, the data in the “item” column is actually pretty close to our first list. All the additional detail are included in new fields that are clearly defined and contain similar data. We’ve had to clean up a few labeling issues (such as “skim milk” vs. “fat free milk”) but these are relatively minor data governance issues. Our final, summarized list is ready for prime time.

Data101_List_6_Summary

And that, my friend, is how data is made.

Infographics and Data Visualization (Week 5/6)

I took part in a brief discussion on the student forum after the Week 4 project and it made me realize that I’d been spending so much time trying to create a functional interactive graphic in Tableau that I was missing out on practicing some of the basic techniques of the class. When you combine that with the fact that my favorite attempt was a sketch I laid out in PowerPoint, I decided that I should try to focus on the structure and design of the graphic to see what I could come up with.

The topic I picked was based on some data that I’d pulled back in May/June that I’d never had a chance to use. This data covered all of the various U.S. breweries and the variety of beers they made. I did some additional research to add some information on beer ingredients (especially water, barley and hops) as well as some interesting stats on beer consumption based on a few fun maps done at FloatingSheep.

I spent a good deal of time coming up with the basic grid of the graphic, which ended up having a static left hand column for the introduction to each topic and then an interactive map of the U.S. on the right. The interactive portion consists of tabbed sections that allow you to navigate through several subtopics.

The flow of the of the series starts with an overview of beer production in the U.S., moves to a section on the ingredients of beer, and ends with information on American beer consumption. (I also thought about including some local beer stats for the great State of Wisconsin but that may have to wait.)

Due to time constraints, these mockups contain sample maps from other sources (here. and here):

Infographics and Data Visualization (Week 4)

The assignment for Week 4 is the based on data used in a recent Guardian article on U.S. unemployment. Having used Bureau of Labor Statistics (BLS) data for many years at my previous job, I am far more familiar with this topic than I was with the data we used for last week’s assignment. In fact, I have already written several blog posts dealing with general employment statistics so it will be challenging to come up with something fresh.

The Guardian article includes an interactive map that highlights the lower 48 states (Hawaii and Alaska are off screen) and allows the user to select one of eight different employment metrics. A five-color scale defines the range of each metric while clicking on an individual state brings up a bar chart displaying a few data points and some additional text.

One problem I have with this map is that I think the states are too large to tell a detailed story about how unemployment affects different areas of the country. Maps at the county level (like this one from the BLS or this gorgeous D3 example posted on GitHub) show far more interesting regional employment patterns and help create a more compelling story. (Alberto Cairo talks about the importance of enumeration unit size in this week’s reading assignment.)

Another criticism is that the map only uses a fraction of the employment/unemployment information available from the BLS. This data is relatively easy to download and so there’s no real reason not to include a richer dataset in the graphic. Additional data would allow more detailed monthly trends and more meaningful comparisons to the National rate and/or the rates of other states.

Finally, I think the color scheme used on this map is hard to interpret. The color categories are not easily distinguished from one another and they don’t relate to any natural scale that the user could use to detect patterns. Creating more categories might also help with interpretation of the data.

The range and structure of the data suggests that there is a good story to be found looking at unemployment before, during and after Obama’s first term. There were certainly some unusual statistics associated with the 2007-2009 recession (as defined by the National Bureau of Economic Research).  It was the worst period of economic performance in the U.S. since the Great Depression and the pace of the recovery is one of the slowest on record.

In fact, until President Obama was re-elected a few weeks ago, no sitting president since World War II had been returned to office with an unemployment rate above 7.2%. This metric was such a sacred cow that conservative pundits accused the BLS of bias when data more favorable to the President was released in the run-up to the election. So, how did Obama earn a second term fighting these headwinds?

My first set of charts presents an overview of unemployment in the U.S. over the past twelve years. I wanted to show both the long-term trend in unemployment as well as a side-by-side comparison of the three most recent presidential terms. I’ve included a shaded area for each of the past two recessions on the first chart to show the effect of the two recessions.

The first thing I noticed by looking at these charts is that, over the past twelve years, the U.S. unemployment rate has never been lower than it was during George W. Bush’s first month in office. The rate got pretty close to that mark in the final months of Bush’s second term but it never quite made it. The second thing I noticed was that the drop in unemployment during the months following the Great Recession was slightly faster than it was during the recovery period following the 2001 recession.

My second chart shows the unemployment rate for each state over the course of Obama’s first term. It also includes a ranking of states by total unemployment and colors each chart using the results of the 2012 election.

Infographics and Data Visualization (Week 3)

The goal of this week’s assignment is to review some global aid data from the Guardian and evaluate how this information should be presented.  This is a two-part assignment and I have been able to download the data and let my thoughts percolate over the past few days. The focus is on the aid transparency index, which uses a broad set of criteria to rank major aid donors on their openness.

I’ll have to admit that my first reaction after looking at the data a bit was a muted “so what?” A simple rank of the aid organizations shows some of the usual good samaritans at the top and an apparent decline in transparency that roughly corresponds to a drop in GDP per capita (or possibly happiness or density of heavy metal bands).

Part of my lukewarm response stems from the fact that don’t really know what the consequences of transparency (or lack of transparency) means. Is there a concern about influence? Bribery? Funding of criminal or terrorist organizations? The U.S. aid organizations are kind of in the middle of the pack, which I suppose is not ideal. However, the U.S. list includes the Department of Defense, which I wouldn’t necessarily expect to be that open given the paticular nature of its mission.

Other questions that come to mind include:

  • What criteria are used to pick the organizations in this list? Who’s missing?
  • Do other military organizations make the list?
  • How is aid defined?
  • Why are some country’s scores aggregated while others are listed separately by organization?

Some of these answers can be found in the primary report, which suggests that the goal of aid transparency is to allow for effective policy planning and decision-making. The report states:

For aid to be more effective it needs to be more predictable, coordinated between donors, managed for results, and aligned to recipient countries’ own plans and systems. To achieve this, the information has to be shared between all parties involved in the delivery of aid in a timely, comprehensive and comparable way. Without this information it is not possible to know what is being spent where, by whom and with what results.

This makes sense … but I don’t know if I would normally associate this goal with “transparency.” To me, transparency has more to do with promoting accountability and providing information to citizens about what their Government is doing. The aid Index seems to be more about project coordination, efficiency and data governance. (Later on in the report, the text does mention that citizens will want to know where their money is going … more of a traditional goal of transparency.)

One of the major tools in the push for transparency is the development of a common standard for publishing aid information through the International Aid Transparency Initiative (IATI). The IATI registry has improved the quality and transparency of aid information, particularly for organizations that have either automated their publication or have already begun to address gaps and inconsistencies.

So, is there a story in the development and adoption of this standard? The report itself suggests that the purpose of the Index is in flux and asks whether a simpler methodology could still achieve the goal of providing effective, efficient and accountable aid information.

As I thought about this chart, I decided that any overview should show both the total transparency score and some measure of improvement from the previous year (there is both a 2011 and 2012 score). I decided on a scatterplot with the total score on the horizontal axis and the change in score (a ratio or percent) from 2011 to 2012 on the vertical axis. Along the right side I also thought I’d include a regular bar chart sorted by score.

A static sketch of this first chart:

I like the way the scatterplot emphasizes both the overall score and the year-over-year improvement. This shows organziations that have made progress toward the ultimate goal of transparency but may not have reached the heights of a group like the World Bank. The bar chart on the right shows standard ranking.

From this chart, the user should be able to navigate to details for each agency. I’d like to see comparisons of each sub-level (agency, organization, country) as well as the individual survey questions. There’s a pretty interesting chart toward the end of the report that shows the responses to all questions for all agenies as colored dots. It is intriguing and might offert some direction to these detailed charts. Otherwise it may be worth exploring standard charts.

 

 

Infographics and Data Visualization (Week 2)

For week two of the course, we’ve been asked to take a look at this interactive graphic from the New York Times, which compares the different words that Democrats and Republicans speakers used during their respective conventions.

Overall, I thought that the graphic was pretty good but there were a few things that I might consider redesigning. The first problem I noticed was that, when you click on the word bubbles, the political quotes below the chart change based on your selection. Unfortunately, most of this interactivity occurs “below the fold” or off-screen so you don’t necessaryily see it right away. I would need to be presented with more cues to know that this was going on. It seems like tightening up the top part of the chart and shrinking some of the ad space or menu heights might help here.

It also took me awhile to figure out that you could type in your own words and add them to the graphic. This feature is pretty cool but I don’t think it is necessarily obvious to first time visitors. I liked how the new word bubbles kind of migrated around to find a spot in the crowd but they sometimes got stuck in the middle of the pack if the words around them were too big.

The bubble sizes are difficult to interpret directly but I don’t think that is necessary for this graphic. I do have a problem with the way the bubbles indicate the % of word usage by political party. I would expect either a pie chart with the % in a slice or maybe a color difference along a spectrum (blue to red).

My first redesign attempt:

Although this “sketch” is not interactive, you can kind of see where I was headed. The first issue I tackled was trying to make it more obvious that the individual words or phrases could be shown in context. I did this by moving the quotes up from the bottom and placing them in cartoon speech bubbles along the sides of the graphic. The directional arrow for each speech bubble points to the word being examined and also indicates a slider that can be moved up and down from word to word. The speech bubbles could expand to include multiple quotes or maybe there could be some other form of gallery navigation within the bubble itself.

The individual words are displayed in a standard bar chart that clearly shows the word itself but doesn’t play with the font size at all. I let all comparisons between the words be shown using the red and blue bars, with relative usage rates treated by the length of the bars. This allows direct comparison of usage rates between the two parties as well as relative comparison between words.

I imagined that typing a word or phrase in the box would add that word or phrase to the top of the “stack” of bar charts, moving the rest of the words down one slot. This way the user could add as many words as they want and scroll down the length of the chart to look at their entire list and make comparisons.

Despite these adjustments, it’s still hard to see how the average user would pull a compelling narrative out of this  presentation without some assistance. To me, the story of this graphic is about the language that the different parties use to craft their messages. The use of certain words over others reflects each party’s priorities and their understanding of the intended audience.

Since we know word choice is designed to influence the audience in some way, it might be interesting to include examples of how the two parties have used language in the past. On the Republican side, Newt Gingrich’s 1994 memo to the GOPAC titled “Language: A Key Mechanism of Control” is a famous example. It contains a list of “optimistic positive governing words” that Gingrich recommended for use in describing Republican politicians and “contrasting words” that he suggested using to describe Democrats.

On the other side of the aisle, people like George Lakoff and Elisabeth Wehling at the The Little Blue Blog use concepts like “frames” to describe how the use of particular words trigger associations with either conservative or progressive moral systems. (Another interesting look at the use of language in politics can be found at Sasha Issenberg’s Victory Lab site.)

Either of these resources might be a good starting point for an analysis of word usage by politicians. In fact, one member of the class posted a quick graphic using Gingrich’s positive words here and I found it fascinating that the top three positive words used by Democrats (fair, building and reform) demonstrated a far different focus than those used by Republicans (liberty, freedom and lead).)

Modifying the NYT graphic to accomodate these investigations might involve the addition of “starter lists” of words such as the top 10 words for each party by word count, top 10 words by uniqueness to each party, or Gingrich’s positive word list. I also like the idea of a word association feature which could suggested related topics via a word cloud or a “you might also try this word” feature.

 

Infographics and Data Visualization (Week 1)

The Introduction to Infographics and Data Visualization course begins Sunday so I’m starting to receive emails from the instructor. The first thing I need to do is tackle the reading list and then take a look at the first assignment, which involves the review of this graphic, which was based on a survey of 32,000 Internet users from 16 different countries. The survey asked these users about the kind of online services they used on a regular basis.

The online class discussion was pretty good and very thorough. My own thoughts began with the graphic “building block” that the designer used to organize and convey information. This consisted of a nested group of overlapping doughnut charts that used color, size and fractional divisions to represent the data for each country (see below).

I think that the arcs of the doughnut are meant to be interpreted in two dimensions: 1) the sweep of the arc represents the % of the category population that is engaged in the activity (similar to a regular pie chart) and 2) the radius from the center represents the overal size of the category population (similar to a regular bubble chart). Both pie charts and bubble charts can work in certain circumstances but they make direct comparisons difficult. Throw in the fact that the arcs overlap and it is almost impossible to understand the meaning associated with different variables. For example, the predominant color in graphs for countries like the U.S. or Canada is pink, which downplays the larger population of social profile users.

My first instinct for adjusting this infographic was to “unpack” the doughnut chart and place the data in a regular bar chart. By using standard bars, it is fairly easy to make comparisons between the different categories. The bar chart also shows percentages naturally if I include a gray bar that represents the total population of internet users. (The value of the gray bar is an assumption on my part, calculated by dividing the user value by the access percentage. This works for almost every country excpet the U.S. and the U.K.)

The real power of this approach comes with side-by-side comparisons of the data. After swapping the axes and adding in the other countries, the resulting chart allows for relatively easy comparison of both overall Internet usage and individual social media involvement. Both the U.S. and U.K. totals are fudged.

One problem I have with this chart is the huge amount of white space in the upper right quadrant. This is caused by the great disparity in size between the Internet populations of the largest and smallest countries. Adjustments like the use of a logarithmic scales or scatterplots might be able to fill out the canvas a bit but they also make direct comparisons more difficult. I’m also not too sure about the color scheme, which I find somewhat distracting.

Tackling both of these issues at once, I’ve removed the seperate colors for the social media categories and added in an overlay that uses a radar chart to show the realtive differences between social media usage within countries.

The radar charts are kind of fun and they make it pretty easy to see different patterns of Internet usage among the 16 countries. The higher social profile participation (and lower blog usage) of Western countries creates a distinctive shape when compared to Asian countries like Japan and South Korea. The two-color scheme also makes it easier to see patterns in the column charts. However, I’m not sure that depending on the order of the columns is enough to compare social categories across countries.

I’m going to let my solution stand for now. Meanwhile, here are some other solutions from the class and around the web:

 

 

Infographics and Data Visualization (Sign Up)

Despite a crazy schedule, I’ve decided to sign up for a free online course offered by the Knight Center for Journalism called Introduction to Infographics and Data Visualization. It runs from October 28 – December 8 and will be taught by Alberto Cairo, the author of The Functional Art: an Introduction to Information Graphics and Visualization, published by PeachPit Press. I will be sure to post my completed assignments here. It should be fun!