Author Archives: mkinde

Spelling and National Security

A former co-worker of mine always used to joke about our company’s customer database by posing the deceptively simple question: “How many ways can you spell ‘IBM’?” In fact, the number of unique entries for that particular client was in the dozens. Here is a sample of possible iterations, with abbreviations alone counting for several of them:

  • IBM
  • I B M
  • I.B.M.
  • I. B. M.
  • IBM CORP
  • IBM CORPORATION
  • INTL BUS MACHINES
  • INTERNATION BUSINESS MACHINES
  • INTERNATIONAL BUSINESS MACHINES
  • INTERNATIONAL BUSINESS MA

I thought of this anecdote recently while I was reading an article about the government’s Terrorist Identities Datamart Environment list (TIDE), an attempt to consolidate the terrorist watch lists of various intelligence organizations (CIA, FBI, NSA, etc.) into a single, centralized database. TIDE was coming under scrutiny because it had failed to flag Tamerlan Tsarnaev (the elder suspect in the Boston Marathon bombings) as a threat when he re-entered the country in July 2012 after a six-month trip to Russia. It turns out that Tsarnaev’s TIDE entry didn’t register with U.S. customs officials because his name was misspelled and his date of birth was incorrect.

These types of data entry errors are incredibly common. I keep a running list of direct marketer’s misspellings of my own last name and it currently stands at 22 variations. In the data world, these variation can be described by their “edit distance” or Levenshtein distance — the number of single character changes, deletions, or insertions required to correct the entry.

Actual Name Phonetic Misspellings Dropped Letters Inserted Letters Converted Letters
Kinde Kindy Kine Kiinde Kinoe
Kindee Inde Kinder Kimbe
Kindle Kimde
Kindde Isinde
Kindke Pindy
Kindl
Kinds
Kinge
Kinele
Winde
Kinae
Kincius
Jindy

Many of these typographical mistakes are the result of my own poor handwriting, which I admit can be difficult to transcribe. However, if marketers have this much trouble with a basic, five-letter last name, you can imagine the problems the feds might have with a longer foreign name with extra vowels, umlauts, accents, and other flourishes thrown in for good measure. Add in a first name and a middle initial and the list of possible permutations grows quite large … and this doesn’t even begin to address the issue of people with the same or similar names. (My own sister gets pulled out of airport security lines on a regular basis because her name doppelgänger has caught the attention of the feds.)

The standard solutions for these types of problems typically involve techniques like fuzzy matching algorithms and other programmatic methods for eliminating duplicates and automatically merging associated records. The problem with this approach is that it either ignores or downplays the human element in developing and maintaining such databases.

My personal experience suggests that most people view data and databases as an advanced technological domain that is the exclusive purview of programmers, developers, and other IT professionals. In reality, the “high tech” aspect of data is limited to its storage and manipulation. The actual content of databases — the data itself — is most decidedly low tech … words and numbers. By focusing popular attention almost exclusively on the machinery and software involved in data processing, we miss the points in the data life-cycle where most errors start to creep in: the people who enter information and the people who interpret it.

I once worked at a company where we introduced a crude quality check to a manual double-entry process. If two pieces of information didn’t match, the program paused to let the person correct their mistake. The data entry folk were incensed! The automatic checks were bogging down the process and hurting their productivity. Never mind that the quality of the data had improved … what really mattered was speed!

On the other hand, I’ve also seen situations where perfectly capable people had difficulty pulling basic trends from their Business Intelligence (BI) software. The reporting deployments were so intimidating that people would often end up moving their data over to a copy of Microsoft Excel so they could work with a more familiar tool.

In both cases, the problem wasn’t the technology per se, but the way in which humans interacted with the technology. People make mistakes and take shortcuts … it is a natural part of our creativity and self-expression. We’re just not cut out to follow the exacting standards of some of these computerized environments.

In the case of databases like TIDE, as long as the focus remains on finding technical solutions to data problems, we miss out on what I think is the real opportunity — human solutions that focus on usability, making intuitive connections, and the ease of interpretation.

Update:

  • July 7, 2013 – In a similar database failure, Interpol refused to issue a worldwide “Red Notice” for Edward Snowden recently because the U.S. paperwork didn’t include his passport number and listed his middle name incorrectly.
  • January 2, 2014 – For a great article on fuzzy matching, check out the following: http://marketing.profisee.com/acton/attachment/2329/f-015e/1/-/-/-/-/file.pdf.

2012: The Year of the Amateur

One of the topics that seemed to keep cropping up in the news this year was the growing power of the amateur in public life. This trend is not necessarily new but it has been gaining momentum as modern technologies make it easier for the average person to create things (i.e. books, music, videos or physical products) and deliver them to a wider audience. Combine this with an anemic economic recovery and you have the perfect environment for people striking out on their own.

American history is full of passionate amateurs who ignored societal rules or overcame an entrenched bureaucracy to introduce new and exciting ideas to our culture. We admire the business entrepreneur, the garage band, and the inventor working out of his basement. They are some of our most cherished icons and they speak to our desire to make it big on our own terms. This attitude finds its purest expression in the Do-It-Yourself (DIY) ethic, which encourages individuals to bypass specialists altogether and seek out knowledge and expertise on their own.

There are some problems with this relentless individualism, however. Taken to the extreme this skeptical attitude toward the professional “elite” can lead to the distrust — and perhaps even disdain — of true experts. People now diagnose their own medical conditions, create their own legal documents, homeschool their own children, and regularly deny the validity of scientifically accepted facts. In an article which discusses recent changes in the distribution of information, Larry Sanger talks about how the aggregation of public opinion on the Internet (what he calls the “politics of knowledge”) has eroded our very understanding and respect for reliable information:

“With the rejection of professionalism has come a widespread rejection of expertise — of the proper role in society of people who make it their life’s work to know stuff.”

Everybody’s an expert now, in the sense that we can all do our own research online and come to our own conclusions about any topic under the sun. It’s the perfect democratization of knowledge … except most of us aren’t really experts in the traditional sense. Experts typically possess a very deep understanding of a subject and are aware of its subtleties and nuances. The average person may only scratch the surface of a topic and can miss important details because they literally don’t know what they don’t know. Nobody’s seriously going to call in an amateur cardiac surgeon if they’ve got a heart problem, so why is it so easy to dismiss the work of professionals in other fields?

Before I’m accused of being elitist, let’s lay down a framework for discussing the differences between amateurs, experts, and professionals. In an article published by Wharton, Kendall Whitehouse draws the distinction between “knowledgeable enthusiasts” (amateurs) and professionals based on the editorial process (this is in a journalistic context):

“Carefully checked sources and consistent editorial guidelines are key differences between most professional and amateur content … The latter brings quickness and a personal viewpoint and the former provides analysis and consistent quality.”

While I certainly agree that results are important, there are plenty of situations where amateurs deliver results that are as good as those of professionals. In fact, the DIY community frequently uses the term amateur expert and notes that the word “amateur” stands in contrast to the commercial motivation (i.e. financial reward) of the professional, not their level of skill. Following this reasoning, a professional is not necessarily an expert, they are simply someone who happens to get paid for what they do. An amateur can still be an expert based on their skills and abilities, they just don’t get paid.

If the amateur/professional word pairing makes sense, we still need an antonym of “expert” to refer to deficiencies in skills. In this case, I would suggest the term “novice,” which is defined as someone who has very little training or experience. Essentially this means that a thorough discussion of experts and amateurs needs to account for both a financial dimension (amateur vs. professional) and a skill or experience dimension (novice vs. expert). I’ve created a quick quad chart to visualize these relationships:

If we return to our previous discussion, we can now see that the rejection of expertise does not necessarily represent support for the plucky amateur, it represents a shift toward glorification of the naive. Sure, there are times when novices can bring a fresh perspective to established practices (punk rockers and other creative outsiders come to mind). But in 2012, the growing regularity of this superficial approach led to a few very interesting — and very public — failures.

The first example is the unauthorized attempt by an elderly parishioner to restore a painting in a Spanish church over the summer. The tragi-comic results of Cecilia Gimenez’s fresco fiasco were all over the news in August and it was pretty clear to everyone that her work was a massive failure. Using our new definition, she is clearly a novice (unskilled) amateur (unpaid).

Ms. Gimenez later complained that, with all the attention that her botched restoration of Ecce Homo had gotten, she should have received some compensation for her work. This would have made her a quasi-professional, I guess, but I don’t suppose there are a lot of museums out there who’d be willing to hand over their cultural treasures to her care.

(To create your own Ecce Homo restoration, check out this site.)

The second example was the National Football League’s use of replacement referees during the early part of the 2012 season. With the regular officials locked out due to contracts negotiations, NFL management brought in referees from semi-professional football leagues, lower college divisions and even high schools in hopes that nobody would notice the difference. They noticed.

Throughout the preseason, a series of bad penalties, missed calls, and even blown coin tosses made it clear that the new guys were not ready for prime time. As the regular season progressed and the mistakes accumulated, demands for the return of the regular refs grew louder. Finally, two days after the outcome of a game between the Green Bay Packers and the Seattle Seahawks was decided by a controversial call, an agreement between the NFL and NFL Referees Association was reached. (Photo below from the Washington Post.)

NFL management clearly misjudged the level of skill needed officiate a pro football game and how quickly the replacement refs would be exposed for what they were: novice professionals. This isn’t to say that some of these guys couldn’t have developed into perfectly good officials over time. But such a high-profile occupation doesn’t really lend itself to on-the-job training.

Not all skilled workers are lucky enough to have their expertise hit the bottom line so obviously. Writing in an article about the NFL lockout, Paul Weber noted that”

“Attitudes about expertise can … make it a risky hand to play in a negotiation, depending on who’s on the other side of the table. The idea that no one is irreplaceable and there’s always a guy next in line willing to do the job run deep in America. Professing expertise can also bring on suspicions of elitism and scratch an itch to knock someone down a peg.”

This inclination can be seen clearly in my third example of the year, which involves several high-profile political pundits who insisted that Mitt Romney would win the 2012 Presidential election. When statistician Nate Silver of the New York Times began predicting an Obama victory back in June, many conservative commentators questioned both his methodology and his masculinity (offending comments have since been removed).

Despite Silver’s clear statements regarding the laws of probability, conservatives just could not get past the fact that most of their favored polls (University of Colorado, Rasmussen) showed a neck-and-neck race. In the end, the elections validated the statistical approach that Silver used and forced many people to rethink their reliance on ‘unskewed’ polls or Karl Rove’s math skills.

Although the animosity toward Silver subsided after the election, I have my doubts that his success will lead to a sudden surge in respect for professional experts. There seems to be a natural tendency in our culture to distrust anyone who stakes a claim to the truth — especially if we don’t like what they’re saying.

The most vociferous of these battles are those fought between journalists and bloggers but there are plenty of other amateur/professional pairings that set off fireworks. In a recent book review on Slate, professional writer Doree Shafrir openly wonders why anyone would be satisfied with being an amateur. To her, the only path to gratification and validation is through professional success:

“The idea of being an office drone by day and by night being, say, an amateur astronomer is completely bizarre to me. Why wouldn’t you just be an astronomer?”

To which a wise reader responds:

“The sad fact is that many of us simply aren’t good enough at what we really love to do it for a living … Or we were good, but unlucky. Or unwilling to sacrifice our families. Or we’re still living down the consequences of a previous failure.”

Amateur interests are a way for someone to gain new skills, test drive a new career, or just participate in a community despite the fact that they aren’t collecting a paycheck. The amateur/professional spectrum doesn’t just exist at the endpoints, it runs the gamut from hobbyists and tinkerers to semi-professionals and professionals. Back in 2004, a report titled The Pro-Am Revolution by Charles Leadbeater (a frequent contributor to TED), suggested that improved tools and new methods of collaboration are helping to create a breed of amateurs that hold themselves to professional standards and can even produce significant discoveries.

In the field of astronomy, these “demi-experts” had an amazing year. Recent developments in computer technology and digital imaging have allowed amateur astronomers to explore regions of the universe never before seen by non-scientists. Plus, the sky is so vast (and observation time so restricted) that serious amateurs can help professional astronomers simply by observing unrecorded (or underrecorded) stellar objects. Significant amateur finds in 2012 included: new comets; new exoplanets; explosions on Jupiter; a planet with four suns; a detailed map of Ganymede; mysterious clouds on Mars; and even previously undiscovered photos from the Hubble telescope.

While these examples make it clear that amateurs can contribute meaningfully to many fields, it is less obvious how society can avoid the pitfalls associated with the well-intended novice. The key, I think, is for everyone — from novice to expert, amateur to professional — to recognize their own limitations. Businesses want expertise but they don’t always want to pay for it. People want to do what they love but they don’t always have the time or skills to make it their career. A novice who tries to recreate the work of an expert will almost certainly fail but an amateur with passion and drive can spur innovations beyond the abilities of entrenched professionals.

These labels are fluid. All experts were once beginners and all professionals were once unpaid. People progress from novice to expert in distinct stages but they can also move from expert to novice if they change careers. In today’s job market, it even seems possible that some of us could apply all of these labels to ourselves at once. To paraphrase author Richard Bach, a professional is simply an amateur who didn’t quit.

Further Reading:

Infographics and Data Visualization (Week 5/6)

I took part in a brief discussion on the student forum after the Week 4 project and it made me realize that I’d been spending so much time trying to create a functional interactive graphic in Tableau that I was missing out on practicing some of the basic techniques of the class. When you combine that with the fact that my favorite attempt was a sketch I laid out in PowerPoint, I decided that I should try to focus on the structure and design of the graphic to see what I could come up with.

The topic I picked was based on some data that I’d pulled back in May/June that I’d never had a chance to use. This data covered all of the various U.S. breweries and the variety of beers they made. I did some additional research to add some information on beer ingredients (especially water, barley and hops) as well as some interesting stats on beer consumption based on a few fun maps done at FloatingSheep.

I spent a good deal of time coming up with the basic grid of the graphic, which ended up having a static left hand column for the introduction to each topic and then an interactive map of the U.S. on the right. The interactive portion consists of tabbed sections that allow you to navigate through several subtopics.

The flow of the of the series starts with an overview of beer production in the U.S., moves to a section on the ingredients of beer, and ends with information on American beer consumption. (I also thought about including some local beer stats for the great State of Wisconsin but that may have to wait.)

Due to time constraints, these mockups contain sample maps from other sources (here. and here):

Infographics and Data Visualization (Week 4)

The assignment for Week 4 is the based on data used in a recent Guardian article on U.S. unemployment. Having used Bureau of Labor Statistics (BLS) data for many years at my previous job, I am far more familiar with this topic than I was with the data we used for last week’s assignment. In fact, I have already written several blog posts dealing with general employment statistics so it will be challenging to come up with something fresh.

The Guardian article includes an interactive map that highlights the lower 48 states (Hawaii and Alaska are off screen) and allows the user to select one of eight different employment metrics. A five-color scale defines the range of each metric while clicking on an individual state brings up a bar chart displaying a few data points and some additional text.

One problem I have with this map is that I think the states are too large to tell a detailed story about how unemployment affects different areas of the country. Maps at the county level (like this one from the BLS or this gorgeous D3 example posted on GitHub) show far more interesting regional employment patterns and help create a more compelling story. (Alberto Cairo talks about the importance of enumeration unit size in this week’s reading assignment.)

Another criticism is that the map only uses a fraction of the employment/unemployment information available from the BLS. This data is relatively easy to download and so there’s no real reason not to include a richer dataset in the graphic. Additional data would allow more detailed monthly trends and more meaningful comparisons to the National rate and/or the rates of other states.

Finally, I think the color scheme used on this map is hard to interpret. The color categories are not easily distinguished from one another and they don’t relate to any natural scale that the user could use to detect patterns. Creating more categories might also help with interpretation of the data.

The range and structure of the data suggests that there is a good story to be found looking at unemployment before, during and after Obama’s first term. There were certainly some unusual statistics associated with the 2007-2009 recession (as defined by the National Bureau of Economic Research).  It was the worst period of economic performance in the U.S. since the Great Depression and the pace of the recovery is one of the slowest on record.

In fact, until President Obama was re-elected a few weeks ago, no sitting president since World War II had been returned to office with an unemployment rate above 7.2%. This metric was such a sacred cow that conservative pundits accused the BLS of bias when data more favorable to the President was released in the run-up to the election. So, how did Obama earn a second term fighting these headwinds?

My first set of charts presents an overview of unemployment in the U.S. over the past twelve years. I wanted to show both the long-term trend in unemployment as well as a side-by-side comparison of the three most recent presidential terms. I’ve included a shaded area for each of the past two recessions on the first chart to show the effect of the two recessions.

The first thing I noticed by looking at these charts is that, over the past twelve years, the U.S. unemployment rate has never been lower than it was during George W. Bush’s first month in office. The rate got pretty close to that mark in the final months of Bush’s second term but it never quite made it. The second thing I noticed was that the drop in unemployment during the months following the Great Recession was slightly faster than it was during the recovery period following the 2001 recession.

My second chart shows the unemployment rate for each state over the course of Obama’s first term. It also includes a ranking of states by total unemployment and colors each chart using the results of the 2012 election.

A Thanksgiving Meal Preparation Timeline

The art of timing the preparation of Thanksgiving dishes takes years of experience and perhaps more than a few hard lessons in the kitchen (ever have anyone de-bone a turkey?). For those less experienced chefs, I’ve always felt that a good inforgraphic might help organize the work so that all the dishes are ready at the proper time.

I didn’t have the time to document my own family’s meal this year but I noticed that L.V. Anderson over at Slate wrote a great piece on her attempt to organize a full dinner. She sums up the issues nicely:

Cooking a Thanksgiving meal is a somewhat masochistic enterprise. It’s rewarding, for sure, and fun if you like cooking. But perfectly coordinating the timing of several dishes—nearly all of which taste best hot, many of which require oven time, and some of which begin deteriorating in quality shortly after you finish cooking them—is, well, impossible.

I’ve taken her instructions and organized them into a timeline with a target mealtime of 3:00 PM. Each box in the chart represents a 15-minute interval and clicking on it describes the task and provides a link to the recipe. Here it is … posted just under the wire:

It still needs some work so I’ll be making a few changes over the weekend. Meanwhile, Happy Thanksgiving!

Infographics and Data Visualization (Week 3)

The goal of this week’s assignment is to review some global aid data from the Guardian and evaluate how this information should be presented.  This is a two-part assignment and I have been able to download the data and let my thoughts percolate over the past few days. The focus is on the aid transparency index, which uses a broad set of criteria to rank major aid donors on their openness.

I’ll have to admit that my first reaction after looking at the data a bit was a muted “so what?” A simple rank of the aid organizations shows some of the usual good samaritans at the top and an apparent decline in transparency that roughly corresponds to a drop in GDP per capita (or possibly happiness or density of heavy metal bands).

Part of my lukewarm response stems from the fact that don’t really know what the consequences of transparency (or lack of transparency) means. Is there a concern about influence? Bribery? Funding of criminal or terrorist organizations? The U.S. aid organizations are kind of in the middle of the pack, which I suppose is not ideal. However, the U.S. list includes the Department of Defense, which I wouldn’t necessarily expect to be that open given the paticular nature of its mission.

Other questions that come to mind include:

  • What criteria are used to pick the organizations in this list? Who’s missing?
  • Do other military organizations make the list?
  • How is aid defined?
  • Why are some country’s scores aggregated while others are listed separately by organization?

Some of these answers can be found in the primary report, which suggests that the goal of aid transparency is to allow for effective policy planning and decision-making. The report states:

For aid to be more effective it needs to be more predictable, coordinated between donors, managed for results, and aligned to recipient countries’ own plans and systems. To achieve this, the information has to be shared between all parties involved in the delivery of aid in a timely, comprehensive and comparable way. Without this information it is not possible to know what is being spent where, by whom and with what results.

This makes sense … but I don’t know if I would normally associate this goal with “transparency.” To me, transparency has more to do with promoting accountability and providing information to citizens about what their Government is doing. The aid Index seems to be more about project coordination, efficiency and data governance. (Later on in the report, the text does mention that citizens will want to know where their money is going … more of a traditional goal of transparency.)

One of the major tools in the push for transparency is the development of a common standard for publishing aid information through the International Aid Transparency Initiative (IATI). The IATI registry has improved the quality and transparency of aid information, particularly for organizations that have either automated their publication or have already begun to address gaps and inconsistencies.

So, is there a story in the development and adoption of this standard? The report itself suggests that the purpose of the Index is in flux and asks whether a simpler methodology could still achieve the goal of providing effective, efficient and accountable aid information.

As I thought about this chart, I decided that any overview should show both the total transparency score and some measure of improvement from the previous year (there is both a 2011 and 2012 score). I decided on a scatterplot with the total score on the horizontal axis and the change in score (a ratio or percent) from 2011 to 2012 on the vertical axis. Along the right side I also thought I’d include a regular bar chart sorted by score.

A static sketch of this first chart:

I like the way the scatterplot emphasizes both the overall score and the year-over-year improvement. This shows organziations that have made progress toward the ultimate goal of transparency but may not have reached the heights of a group like the World Bank. The bar chart on the right shows standard ranking.

From this chart, the user should be able to navigate to details for each agency. I’d like to see comparisons of each sub-level (agency, organization, country) as well as the individual survey questions. There’s a pretty interesting chart toward the end of the report that shows the responses to all questions for all agenies as colored dots. It is intriguing and might offert some direction to these detailed charts. Otherwise it may be worth exploring standard charts.