Data Literacy 101: What is Data?

Whenever the topic of data comes up at meetings or informal conversations it doesn’t take long for people’s eyes to glaze over. The subject is usually considered so complex and esoteric that only a few technically-minded geeks find value in the details. This easy dismissal of data is a real problem in the modern business world because so much of what we know about customers and products is codified as information and stored in corporate databases. Without a high level of data literacy this information sits idle and unused.

One way I try to get people more interested in data is to make a distinction between data management and data content. In its broadest sense, data management consists of all the technical equipment, expertise, security procedures, and quality control measures that go into riding herd on large volumes of data. Data content, on the other hand, is all the fun stuff that is housed and made accessible by this infrastructure. To put it another way, think of data management as a twisty, mountain road built by skilled engineers and laborers while data content is the Ferrari you get to drive on it.

Okay, maybe that’s taking it a bit too far. Stick with me.

At its most basic, data is simply something you want to remember (a concept I borrowed from an article by Rob Karel). Examples might include:

  • Your home address
  • Your mom’s birthday
  • Your computer password
  • A friend’s phone number
  • Your daughter’s favorite color

You could simply memorize this information, of course, but human memory is fragile and so we often collect personally meaningful information and store it in “tools” like calendars, address books, spreadsheets, databases, or even paper lists. Although this last item might not seem like a robust data storage method it is a good introduction to some basic data concepts. (I’ve talked about the appeal of “Top 10” lists as a communication tool in a previous post but I didn’t really address their specific structure.)

Let’s start with a simple grocery list:

Data101_List_1

Believe it or not, this is data. A list like this has a very loose data structure consisting of related items separated by some sort of “delimiter” like a comma or — in this case — a new line or row on our fake note pad. You can add or subtract items from the list, count the total number of items, group items into categories (like “dairy” or “bakery”), or sort items by some sequence. Many of you will have created similar lists because they are great external memory aids.

The problem with this list is that it is very generalized. You could give this grocery list to ten different people and get ten different results. How many eggs do you want? Do you want whole milk, 2%, or fat free? What type of bread do you want? What brand of peanut butter do you like?

This list really only works for you because a memory aid works in concert with your own personal circumstances. If someone doesn’t share that context then the content itself doesn’t translate very well. That’s okay for “to do” lists or solo trips to the grocery store but doesn’t work for a system that will be used by multiple people (like a business). In order to overcome this barrier you have to add specificity to your initial list.

Data101_List_2

This is a grocery list that I might hand over to my teenage son. It is more specific than the first list and has exact amounts and other additional details that he will need to get the order right. Notice, however, that there is a cost for this increased level of specificity, with the second list containing over four times as many characters as the first one. At the same time, this list still lacks key attributes that would help clarify the request for non-family members.

If we are going to make this list more useful to others, we need to continue to improve its specificity while making it more versatile. One way to do this is to start thinking about how we would merge several grocery lists together.

Data101_List_3

Here is our original list stacked on top of a second list of similar items. I’ve added brand names to both of them and included a heading above each list with the name of the list’s owner. The data itself is still “unstructured”, however, meaning it is not organized in any particular way. This lack of structure doesn’t necessarily interfere with our goal of buying groceries but it does limit our ability to organize items or find meaningful patterns in the data. As our list grows this problem is compounded. Eventually, we’ll need to find some way of introducing structure to our lists.

Data101_List_4

One step we can take is to break up our list entries and put the individual pieces into a table. A table is an arrangement of rows and column where each row represents a unique item, while each column or “field” contains elements of the same data “type.” For this first example, I’ve created three columns: a place for a “customer” name (the text of the list’s owner), an item count (a number), and the item itself (more text). Notice that the two lists are truly merged, allowing us to sort items if we want.

Data101_List_4_sorted

Sorting makes it a bit easier to pick out similar items, which will help a little on our fictitious shopping trip. However, we still have a problem. Some of the items (like the milk, butter, and peanut butter) are sorted by the size criteria listed in the unstructured text, which makes it harder to see that some of things can be found in same aisle. Adding new fields will help with this.

Data101_List_5_sorted

By adding separate columns for brand name and size, the data in the “item” column is actually pretty close to our first list. All the additional detail are included in new fields that are clearly defined and contain similar data. We’ve had to clean up a few labeling issues (such as “skim milk” vs. “fat free milk”) but these are relatively minor data governance issues. Our final, summarized list is ready for prime time.

Data101_List_6_Summary

And that, my friend, is how data is made.

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Information
The Short-Circuiting of the American Mind (Part 3: The OCDN Doom Loop)

“Just remember, what you’re seeing and what you’re reading is not what’s happening” – Donald Trump If we accept the premise that American society has intentionally damaged its ability to make decisions, we can return to John Boyd’s OODA framework to see exactly how various political, cultural, and technological forces …

Information
Lexi-Conflict: Harris vs Pence

Another fun debate! Since I already had the methodology in place from my evaluation of the Trump v Biden debates, it seemed like a logical step to tackle the vice-presidential debate as well. The same basics apply here: transcript from the The Rev and the text inspector tool from the …

Information
Lexi-Conflict: Trump vs Biden

The political circus surrounding the U.S. election has already moved on to something more interesting but I wanted to take a look at last week’s presidential debates from a lexicological standpoint. Full disclosure: I didn’t actually watch the entire debate in real time because I value my sanity. However, I …