Tag Archives: Data

Politicians Discover Data Science

During the 2008 U.S. Presidential campaign, the online design community devoted a lot of pixels to comparisons of the two candidate’s web sites (a few great examples here, here, and here). The overall consensus was that Obama won the war for eyeballs by emphasizing design, web usability, multimedia, and robust social networking. According to an in-depth study by the Pew Research Center’s Project for Excellence in Journalism, Obama’s online network was over five times larger than McCain’s by election day and his site was drawing almost three times as many unique visitors each week.

There is no doubt that the web has fundamentally transformed the way political campaigns are run. Voters are no longer tied to traditional media outlets for information and they can participate directly in a campaign in ways that were unimaginable only a few years ago. Adam Nagourney, columnist for the New York Times, summed it up nicely:

[The Internet has] rewritten the rules on how to reach voters, raise money, organize supporters, manage the news media, track and mold public opinion, and wage — and withstand — political attacks.

So, with the next campaign season gearing up, what technology-driven changes can we expect for 2012? If the rumblings are true, this election may see the ascendancy of data science as a formal part of the campaign toolkit.

In a recent CNN article, Micah Sifry wrote about the Obama campaign’s establishment of a “multi-disciplinary team of statisticians, predictive modelers, data mining experts, mathematicians, software developers, general analysts and organizers.” The article goes on to discuss the importance of data harmonization (a fancy term for master data management), geo-targeting, and integrated marketing.

Obama may be struggling in the polls and even losing support among his core boosters, but when it comes to the modern mechanics of identifying, connecting with and mobilizing voters, as well as the challenge of integrating voter information with the complex internal workings of a national campaign, his team is way ahead of the Republican pack.

All this has some GOP supporters concerned. Martin Avila, a Republican technology consultant, states in the same article that he doesn’t think that anyone on the opposing side fully understands the power of organizing and analyzing all of this data. According to Avila, the current GOP use of information technology is still largely shaped by its pre-Internet experience in broadcast advertising.

In some ways, this cavalier attitude toward the value of data shouldn’t come as a complete surprise. One trait that many members of the so-called “party of business” share with executives in the private sector is a strong attachment to a “gut based” approach to making decisions.

A recent Accenture Analytics survey of over 600 managers at more than 500 companies found that senior managers rarely used data-driven analysis when making key business decisions and instead relied heavily on intuition, peer-to-peer consultation, and other soft factors. According to the study, 50% of companies weren’t even structured in a way that would allow them to use data and analytical talent to generate enterprise-wide insight. In addition, those organizations that did make analytics-based decisions often depended on inconsistent, inaccurate, or incomplete data.

Savvy voters, like savvy customers, have come to expect a certain level of performance and consistency from the IT systems they use. This is bad news for businesses that still think that things like social media, data analytics, and master data management are gimmicks:

Organizations that fail to tackle the issues around data, technology and analytics talent will lose out to the high-performing 10 percent who have leveraged predictive analytics to become more agile and gain competitive advantage.

Creating a structured program for better targeting and more efficient communications seems like a no-brainer these days, but, for now, there doesn’t seem to be a lot of competition.

Further Reading:

    • 1/30/2012 – Slate recently published an article that talks about the different philosophies guiding the development of Democratic and Republican voter databases. Catalist, an independent data initiative, is focused less on profit and more on becoming “an indispensable tactical resource for the American left” with a privately-funded data warehouse containing records of the entire voting-age population combined with other commercially available data. It’s customers include many traditionally liberal groups who consider the Democratic National Committee’s database insufficient. In response, the DNC has stepped up development of its own database, the Voting List Management Cooperative (or “Co-op”). In order to take advantage of the increased desire for voter information, the DNC has also developed statistical models that are particularly valuable for candidates. Meanwhile, the Republican National Committee established the Data Trust, a private company filled to the brim with former RNC staffers and committee members. The goal of this organization is to create robust voter profiles that can be shared with political allies. However, because of concerns about outside influence, the RNC is modeling it more along the lines of the DNC’s data co-operative instead of the more independent Catalist. The Data Trust development model is also less focused on data mining activities and more on basic data.
      7/17/2012 – Another Slate article. This one covers the Romney campaign’s attempt to boost its analytics efforts. Their initial approach appears to center on trying to figure out the President’s strategy by tracking his movements and breaking down his ad buys. This seems pretty reactive to me but time will tell.

    Six Degrees of Joy Division

    My local record store used to have this great poster on the back wall that explained how several dozen British indie bands from the 80s were all linked together through their various group members. The title of the poster was something like “Why All These Bands Sound the Same” and it was clearly a tongue-and-cheek slam of the gloomy post-punk sound of musical groups like Bauhaus and the Smiths.

    I loved the design concept and looked for the poster when the store finally went out of business a few years ago. Although I never found it, it occurred to me recently that I might be able to reconstruct the graphic using some modern tools and data from the online music site AllMusic.com.

    AllMusic is an outstanding musical resource and their meticulous site formatting allowed me to write a program that would crawl from page to page gathering information about interrelated bands and band members as it went. I decided to use the group Joy Division as a starting point because I liked the movie Control and had a vague memory of that particular band name appearing on the poster. The program ran over night … evaluating 37,538 separate pages before it completed its run.

    Using the IBM visualization tool, Many Eyes, I created a network diagram of the bands that are within six steps of my “seed” group. The full interactive results are at the end of the post (worth the effort if the Many Eyes site is working) but here is a detail:


    The Joy Division Network

    At nearly 38K records, this particular musical network covers a huge swath of Anglo-American rock-and-roll and includes almost all of the major groups in the Pop/Rock genre. What’s perhaps most interesting about this massive network is the fact that Joy Division is only linked to two bands directly, the acclaimed New Order (formed in 1980 after the death of JD vocalist Ian Curtis) and the Manchester supergroup Freebass (formed in 2004). All other connections are indirect, with a total of 20 degrees of separation between Joy Division and the most distant band in the network, post-grunge Los Angeles outfit Open Hand (formed in 2000).


    Other Thoughts on the Data

    The first odd thing I noticed about the network was that, by focusing on the relationships between bands, the network excludes a lot of well-known solo artists. Even when these musicians joined a band, their independent careers limited these associations to one or two instances. The best example of this situation would be someone like Elvis Presley or Johnny Cash. Both of these artists were loosely linked together through a glorified hootenanny called The Million Dollar Quartet (along with Carl Perkins and Jerry Lee Lewis). The only other bands in this network are The Offenders and the Cash-related groups The Highwaymen and Johnny Cash & the Tennessee Two. Some of the other solo artists in this minor network are household names (depending on the household, of course), including Waylon Jennings, Kris Kristofferson, and Willie Nelson. Three bands, a half-dozen stars and a lot of hits … but no direct connection to the huge Joy Division network. Many current rap artists seem to fit this mold as well.

    On the flip side, progressive rock groups like King Crimson had members who were in dozens of other bands. These social connectors can be seen at the center of a huge spider web of interrelated groups in the network diagram. Bands like these are often experimental in nature, with talented musicians putting their stamp on a number of different side projects. Some very influential artists can be spotted in the midst of these groups, including — using King Crimson as an example — famous journeyman players like Robert Fripp, Adrian Belew, John Wetton and Greg Lake.

    Finally, although I distinctly remember the band Bauhaus and its associated constellation of bands (Love & Rockets, Tones on Tail, The Jazz Butcher, etc.) on the poster, they were not within six degrees of separation of Joy Division in the network data (they were about eight links away). This exposes an issue with my data gathering methodology because it doesn’t take into account other relationships between artists such as mentors, guest musicians, common producers or other ties. Still, it was an interesting exercise with fruitful results.

    Additional Interactive Charts

    Bubble diagram of musical styles (full band network):
    Network diagram (six degrees of Joy Division):

    They’re Coming for Your Data

    Data privacy breaches seem to be the issue de jour for the tech sector. On October 18, 2010, it was revealed that several of the most popular Facebook applications had transmitted the personal information of tens of millions of users — including ID numbers, demographic data and names of friends — to various outside advertising and data companies. The next day [October 19, 2010], Canada’s privacy commissioner concluded that Google had violated that country’s privacy laws by harvesting personal information from unsecured wireless networks using it’s Street View system. In each case, it appears that the transmission of personal data violated the company’s own stated privacy policies. This means one of two things. Either these two companies didn’t know what their own technology was doing or they did know and are covering up this fact with denials and fingerpointing.

    For all appearances, Google seems to be on the side of the angels on this one. When they first learned of the issue back in April/May, the company immediately ceased data collection and notified the authorities of what had happened. I’m not so sure about Facebook, though. Recent history suggests that the social networking giant has shifted its stand on privacy to better support its business model, which depends on open sharing of user information. Back in December 2009, the company changed the default privacy settings of its software so that users had to opt-out of the public availability of their information. Then, in an interview with Mike Arrington at the January 2010 Crunchies event, Facebook founder Mark Zuckerberg brushed aside privacy concerns by saying that these changes simply reflect the new “social norm” regarding the disclosure of personal information.

    The problem is that, instead of letting people decide for themselves what these new social norms should be, Facebook has made a unilateral decision to nudge their users toward a more open information environment. This is pretty condescending approach. In reality, the company should just make it easier for any user to decided exactly how much personal information they want to share and with whom. The fact that Facebook can’t or won’t give their users the ability to fine tune their own privacy settings tells me that the company is betting its future on people’s willingness to give up their right to privacy for the convenience of talking to people they haven’t seen since Kindergarten.

    Further reading:

    The Etiquette of File Naming

    My company has a public network that is open to all employees and is used to store shared information. The number of folders is quite large and — in an effort to make the search for a particular folder easier — a few people have started adding exclamation points to the front of their file names. This causes their file to float to the top of an alphabetized list, making it show up front and center when you navigate to the top node of the file hierarchy. Now, is this clever or rude?

    Screen real estate is pretty important when it comes to web pages and mobile devices — I’ve heard many stories about departments or individuals who fought to have their information appear “above the fold” on a corporate site — but should this approach apply to an internal file network?

    One could argue that only critical files should take pole position: emergency procedures or frequently accessed company data. But what happens when Jim from Marketing is just too lazy to scroll through everyone else’s stuff and vaults his file to first place? Is it OK for Suzie in Finance to then add two exclamation points to the name of her file? Does Bob the CEO get to use three exclamation points? Where does it all end?

    The whole idea behind a system as venerable as alphabetical order is the we all know how to use it. By circumventing this system, are these people trampling on the rights of others just to gain an — albeit minor — advantage? Is this a trivial act of disobedience or is this more along the lines of cutting in line at the supermarket or parking in a handicapped space?


    (In an interesting experiment, I added 27 exclamation points and a few ampersands to my file name … it was deleted.)

    Relearning an Old Lesson

    Anyone who’s ever used a computer has — at some point — lost a carefully-crafted sequence of ones and zeros to the unforgiving gods of the digital realm. Every time it happens, you mourn, you rage against the sky, you re-write and you move on. Each time, you add another rule-of-thumb to the mental checklist designed to minimize the losses or at least ease the reconstruction effort of the next minor catastrophe. In the end, it all boils down to one simple rule. That’s right, ladies and gentleman: save often.

    I was reminded of this basic lesson when I arrived at work on Monday and was presented with an odd little e-mail from the data warehouse team. Why had I changed the user profile for the data feed to the Marketing server? Hmm … I hadn’t actually done anything to the profile and, when I checked the database, it was more than altered, it was completely missing. Not a big deal, I thought, because I could always re-instate the permissions from another source and then we’d be ready to continue with the upload process. Easy peasy.

    But upon closer inspection of the tables, it became clear that something was slightly off. First of all, the tables were old — not really old but still missing a few months of data. Then I checked a few structural updates I’d made the previous week and they weren’t there. Several new procedures were missing, too, as were a couple of new tables and even recently stored files on the shared network. Not good. It was becoming pretty apparent that the DBA team had done something major over the weekend and that everything on our server had been rolled back to July. Further investigation suggested that they had done their overhaul without a backup.

    It took two days of effort for everyone to finally accept that the information was just plain gone. I began the re-building effort rather reluctantly but it soon dawned on me that this whole incident was really a blessing in disguise. Our department had been working without a net for too long — our server was a creaky old SQL 2000 box leftover from a failed project and we’d never really had any official technical support. Now, we’ll probably be able to switch the whole thing over to a full-fledged production server with upgraded software, a routine maintenance plan and — best of all — a robust data backup procedure.

    In Search of the Right Metric

    The Wisconsin DNR just kicked off its PR campaign for Operation Deer Watch this week. This program is designed to collect data about the state’s deer herd and it is used primarily to keep tabs on the overall deer population. This is serious business for folks who live in America’s Dairyland and the DNR has created detailed modeling techniques to estimate things like population growth and fawn production for the various DMUs (Deer Management Units). It all sounds very interesting but what really caught my attention were some of the metrics that the DNR used and what these metrics meant for the management of the herd. For example:

    • Fawn-to-doe (“fawn recruitment”) ratios – This measure helps track the productivity of female deer and can be used to estimate the population growth of the herd. The ratios depend on many factors (i.e. predation, disease, nutrition) and typically hover around 0.8 — a ratio that suggests that fawn reproduction was good and that fawn losses during spring/summer were low.
    • Buck-to-doe ratios – This measure looks at the proportion of males and females in the deer herd and is used to get an idea of the number of bucks available for harvest and an estimate of when females will reproduce. According to the Texas Parks and Wildlife, a typical 1:5 buck-to-doe ratio — probably the biological maximum — results in a young herd that must be culled fairly dramatically to maintain a stable population. A 1:1 buck-to-doe ratio leads to an older herd with better antler potential but with a reduced annual harvest. (Such a 1:1 ratio is apparently difficult to achieve … most well-managed herds will have a little less than two adult does per adult buck).
    • Buck age-structure – This measure is used to get a better understanding of the adult buck mortality rate. Age is one of the primary influences on the trophy status of bucks and the percent of yearling bucks in population should be high enough to produce a good population of mature males.

    What strikes me about these metrics is how specific they are and how well they fit in with the DNR’s overall strategy. In fact, the more I read about herd management and the DNR’s data collection program, the more it reminded me of a business trying to optimize its performance through a better understanding of the current state of the market. I would say that there are still plenty of companies out there who operate with a lot less precision.

    The difficulty that most businesses have with their metrics is that they don’t have the internal discipline to develop or maintain the right performance indicators they need to make decisions. Major stumbling blocks include: lack of common definitions and terms; lack of good data; and difficulties interpreting results. When faced with some or all of these issues, the tendency is to focus on things that are easier to measure but less relevant.

    For the DNR, the metrics they use relate directly to policy decisions that can affect the quality of the deer herd — things like bag limits, habitat improvements, and the length and timing of the hunting season. Not everyone agrees with details of these policies, of course, but the metrics make it easier for everyone from local hunters to state politicians to determine the level of success or failure.

    Further reading:

    • For more on deer metrics and herd management goals, visit here and here.
    • For some thoughts on Quality Deer Management (QDM) programs, try this site.
    • For details of Wisconsin’s Sex-Age-Kill (SAK) model (and more deer-related data then you ever imagined), check out this paper.