A former co-worker of mine always used to joke about our company’s customer database by posing the deceptively simple question: “How many ways can you spell ‘IBM’?” In fact, the number of unique entries for that particular client was in the dozens. Here is a sample of possible iterations, with abbreviations alone counting for several of them:
- I B M
- I. B. M.
- IBM CORP
- IBM CORPORATION
- INTL BUS MACHINES
- INTERNATION BUSINESS MACHINES
- INTERNATIONAL BUSINESS MACHINES
- INTERNATIONAL BUSINESS MA
I thought of this anecdote recently while I was reading an article about the government’s Terrorist Identities Datamart Environment list (TIDE), an attempt to consolidate the terrorist watch lists of various intelligence organizations (CIA, FBI, NSA, etc.) into a single, centralized database. TIDE was coming under scrutiny because it had failed to flag Tamerlan Tsarnaev (the elder suspect in the Boston Marathon bombings) as a threat when he re-entered the country in July 2012 after a six-month trip to Russia. It turns out that Tsarnaev’s TIDE entry didn’t register with U.S. customs officials because his name was misspelled and his date of birth was incorrect.
These types of data entry errors are incredibly common. I keep a running list of direct marketer’s misspellings of my own last name and it currently stands at 22 variations. In the data world, these variation can be described by their “edit distance” or Levenshtein distance — the number of single character changes, deletions, or insertions required to correct the entry.
|Actual Name||Phonetic Misspellings||Dropped Letters||Inserted Letters||Converted Letters|
Many of these typographical mistakes are the result of my own poor handwriting, which I admit can be difficult to transcribe. However, if marketers have this much trouble with a basic, five-letter last name, you can imagine the problems the feds might have with a longer foreign name with extra vowels, umlauts, accents, and other flourishes thrown in for good measure. Add in a first name and a middle initial and the list of possible permutations grows quite large … and this doesn’t even begin to address the issue of people with the same or similar names. (My own sister gets pulled out of airport security lines on a regular basis because her name doppelgänger has caught the attention of the feds.)
The standard solutions for these types of problems typically involve techniques like fuzzy matching algorithms and other programmatic methods for eliminating duplicates and automatically merging associated records. The problem with this approach is that it either ignores or downplays the human element in developing and maintaining such databases.
My personal experience suggests that most people view data and databases as an advanced technological domain that is the exclusive purview of programmers, developers, and other IT professionals. In reality, the “high tech” aspect of data is limited to its storage and manipulation. The actual content of databases — the data itself — is most decidedly low tech … words and numbers. By focusing popular attention almost exclusively on the machinery and software involved in data processing, we miss the points in the data life-cycle where most errors start to creep in: the people who enter information and the people who interpret it.
I once worked at a company where we introduced a crude quality check to a manual double-entry process. If two pieces of information didn’t match, the program paused to let the person correct their mistake. The data entry folk were incensed! The automatic checks were bogging down the process and hurting their productivity. Never mind that the quality of the data had improved … what really mattered was speed!
On the other hand, I’ve also seen situations where perfectly capable people had difficulty pulling basic trends from their Business Intelligence (BI) software. The reporting deployments were so intimidating that people would often end up moving their data over to a copy of Microsoft Excel so they could work with a more familiar tool.
In both cases, the problem wasn’t the technology per se, but the way in which humans interacted with the technology. People make mistakes and take shortcuts … it is a natural part of our creativity and self-expression. We’re just not cut out to follow the exacting standards of some of these computerized environments.
In the case of databases like TIDE, as long as the focus remains on finding technical solutions to data problems, we miss out on what I think is the real opportunity — human solutions that focus on usability, making intuitive connections, and the ease of interpretation.
- July 7, 2013 – In a similar database failure, Interpol refused to issue a worldwide “Red Notice” for Edward Snowden recently because the U.S. paperwork didn’t include his passport number and listed his middle name incorrectly.
- January 2, 2014 – For a great article on fuzzy matching, check out the following: http://marketing.profisee.com/acton/attachment/2329/f-015e/1/-/-/-/-/file.pdf.