Tag Archives: Data Science

How to Build the Perfect Data Science Team

Although the fields of statistics, data analysis, and computer programming have been around for decades, the use of the term “data science” to describe the intersection of these disciplines has only become popular within the last few years.

The rise of this new specialty — which the Data Science Association defines as “the scientific study of the creation, validation and transformation of data to create meaning” — has been accompanied by a number of heated debates, including discussions about its role in business, the validity of specific tools and techniques, and whether or not it should even be considered a science. For those convinced of its significance, however, the most important deliberations revolve around finding people with the right skills to do the job.

On one side of this debate there are those purists who insist that data scientists are nothing more than statisticians with fancy new job titles. These folks are concerned that people without proper statistics training are trying to horn in on a rather lucrative gig without getting the necessary training. Their solution is to simply ignore the data science buzzword and hire a proper statistician.

At the other end of the spectrum are people who are convinced that making sense out of large data sets requires more than just number-crunching skills, it also requires the ability to manipulate the data and communicate insights to others. This view is perhaps best represented by Drew Conway’s data science venn diagram and Mike Driscoll’s blog post on the three “sexy skills” of the data scientist. In Conway’s case, the components are computer programming (hacking), math and statistics, and specific domain expertise. With Driscoll, the key areas are statistics, data transformation — what he calls “data munging” — and data visualization.

The main problem with this multi-pronged approach is that finding a single individual with all of the right skills is nearly impossible. One solution to this dilemma is to create teams of two or three people that can collectively cover all of the necessary areas of expertise. However, this only leads to the next question, which is: What roles provide the best coverage?

In order to address this question, I decided to start with a more detailed definition of the process of finding meaning in data. In his PhD dissertation and later publication, Visualizing Data, Ben Fry broke down the process of understanding data into seven basic steps:

  1. Acquire – Find or obtain the data.
  2. Parse – Provide some structure or meaning to the data (e.g. ordering it into categories).
  3. Filter – Remove extraneous data and focus on key data elements.
  4. Mine – Use statistical methods or data mining techniques to find patterns or place the data in a mathematical context.
  5. Represent – Decide how to display the data effectively.
  6. Refine – Make the basic data representations clearer and more visually engaging.
  7. Interact – Add methods for manipulating the data so users can explore the results.

These steps can be roughly grouped into four broad areas: computer science (acquire and parse data); mathematics, statistics, and data mining (filter and mine); graphic design (represent and refine); and information visualization and human-computer interaction (interaction).

In order to translate these skills into jobs, I started by selecting a set of occupations from the Occupational Information Network (O*NET) that I thought were strong in at least one or two of the areas in Ben Fry’s outline. I then evaluated a subset of skills and abilities for each of these occupations using the O*NET Content Model, which allows you to compare different jobs based on their key attributes and characteristics. I mapped several O*NET skills to each of Fry’s seven steps (details below).

ONET Skills, Knowledge, and Abilities Associated with Ben Fry’s 7 Areas of Focus

Acquire (Computer Science)

  • Learning Strategies – Selecting and using training/instructional methods and procedures appropriate for the situation when learning or teaching new things.
  • Active Listening – Giving full attention to what other people are saying, taking time to understand the points being made, asking questions as appropriate, and not interrupting at inappropriate times.
  • Written Comprehension – The ability to read and understand information and ideas presented in writing.
  • Systems Evaluation – Identifying measures or indicators of system performance and the actions needed to improve or correct performance, relative to the goals of the system.
  • Selective Attention – The ability to concentrate on a task over a period of time without being distracted.
  • Memorization – The ability to remember information such as words, numbers, pictures, and procedures.
  • Oral Comprehension – The ability to listen to and understand information and ideas presented through spoken words and sentences.
  • Technology Design – Generating or adapting equipment and technology to serve user needs.

Parse (Computer Science)

  • Reading Comprehension – Understanding written sentences and paragraphs in work related documents.
  • Category Flexibility – The ability to generate or use different sets of rules for combining or grouping things in different ways.
  • Troubleshooting – Determining causes of operating errors and deciding what to do about it.
  • English Language – Knowledge of the structure and content of the English language including the meaning and spelling of words, rules of composition, and grammar.
  • Programming – Writing computer programs for various purposes.

Filter (Mathematics, Statistics, and Data Mining)

  • Flexibility of Closure – The ability to identify or detect a known pattern (a figure, object, word, or sound) that is hidden in other distracting material.
  • Judgment and Decision Making – Considering the relative costs and benefits of potential actions to choose the most appropriate one.
  • Critical Thinking – Using logic and reasoning to identify the strengths and weaknesses of alternative solutions, conclusions or approaches to problems.
  • Active Learning – Understanding the implications of new information for both current and future problem-solving and decision-making.
  • Problem Sensitivity – The ability to tell when something is wrong or is likely to go wrong. It does not involve solving the problem, only recognizing there is a problem.
  • Deductive Reasoning – The ability to apply general rules to specific problems to produce answers that make sense.
  • Perceptual Speed – The ability to quickly and accurately compare similarities and differences among sets of letters, numbers, objects, pictures, or patterns. The things to be compared may be presented at the same time or one after the other. This ability also includes comparing a presented object with a remembered object.

Mine (Mathematics, Statistics, and Data Mining)

  • Mathematical Reasoning – The ability to choose the right mathematical methods or formulas to solve a problem.
  • Complex Problem Solving – Identifying complex problems and reviewing related information to develop and evaluate options and implement solutions.
  • Mathematics – Using mathematics to solve problems.
  • Inductive Reasoning – The ability to combine pieces of information to form general rules or conclusions (includes finding a relationship among seemingly unrelated events).
  • Science – Using scientific rules and methods to solve problems.
  • Mathematics – Knowledge of arithmetic, algebra, geometry, calculus, statistics, and their applications.

Represent (Graphic Design)

  • Design – Knowledge of design techniques, tools, and principles involved in production of precision technical plans, blueprints, drawings, and models.
  • Visualization – The ability to imagine how something will look after it is moved around or when its parts are moved or rearranged.
  • Visual Color Discrimination – The ability to match or detect differences between colors, including shades of color and brightness.
  • Speed of Closure – The ability to quickly make sense of, combine, and organize information into meaningful patterns.

Refine (Graphic Design)

  • Fluency of Ideas – The ability to come up with a number of ideas about a topic (the number of ideas is important, not their quality, correctness, or creativity).
  • Information Ordering – The ability to arrange things or actions in a certain order or pattern according to a specific rule or set of rules (e.g., patterns of numbers, letters, words, pictures, mathematical operations).
  • Communications and Media – Knowledge of media production, communication, and dissemination techniques and methods. This includes alternative ways to inform and entertain via written, oral, and visual media.
  • Originality – The ability to come up with unusual or clever ideas about a given topic or situation, or to develop creative ways to solve a problem.

Interact (Information Visualization and Human-Computer Interaction)

  • Engineering and Technology – Knowledge of the practical application of engineering science and technology. This includes applying principles, techniques, procedures, and equipment to the design and production of various goods and services.
  • Education and Training – Knowledge of principles and methods for curriculum and training design, teaching and instruction for individuals and groups, and the measurement of training effects.
  • Operations Analysis – Analyzing needs and product requirements to create a design.
  • Psychology – Knowledge of human behavior and performance; individual differences in ability, personality, and interests; learning and motivation; psychological research methods; and the assessment and treatment of behavioral and affective disorders.

Using occupational scores for these individual ONET skills and abilities, I was able to assign a weighted value to each of Ben Fry’s categories for several sample occupations. Visualizing these skills in a radar graph shows how different jobs (identified using standard SOC or ONET codes) place different emphasis on the various skills. The three jobs below have strengths that could be cultivated and combined to meet the needs of a data science team.

Another example includes occupations that fall outside of the usual sources of data science talent. You can see how — taken together — these non-traditional jobs can combine to address each of Fry’s steps.

According to a recent study by McKinsey, the U.S. “faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions” based on data. Instead of fighting over these scarce resources, companies would do well to think outside of the box and build their data science teams from unique individuals in other fields. While such teams may require additional training, they bring a set of skills to the table that can boost creativity and spark innovative thinking — just the sort of edge companies need when trying to pull meaning from their data.

Updates:

May 2, 2014 – The folks over at DarkHorse Analytics put together a list of the “five faces” of analytics. Great article.

  1. Data Steward – Manages the data and uses tools like SQL Server, MySQL, Oracle, and maybe some more rarified tools.
  2. Analytic Explorer – Explores the data using math, statistics, and modeling.
  3. Information Artist – Organizes and presents data in order to sell the results of data exploration to decision-makers.
  4. Automator – Puts the work of the Explorer and Visualizer into production.
  5. The Champion – Helps put all of the pieces in place to support an analytics environment.

D3 Notes:

Politicians Discover Data Science

During the 2008 U.S. Presidential campaign, the online design community devoted a lot of pixels to comparisons of the two candidate’s web sites (a few great examples here, here, and here). The overall consensus was that Obama won the war for eyeballs by emphasizing design, web usability, multimedia, and robust social networking. According to an in-depth study by the Pew Research Center’s Project for Excellence in Journalism, Obama’s online network was over five times larger than McCain’s by election day and his site was drawing almost three times as many unique visitors each week.

There is no doubt that the web has fundamentally transformed the way political campaigns are run. Voters are no longer tied to traditional media outlets for information and they can participate directly in a campaign in ways that were unimaginable only a few years ago. Adam Nagourney, columnist for the New York Times, summed it up nicely:

[The Internet has] rewritten the rules on how to reach voters, raise money, organize supporters, manage the news media, track and mold public opinion, and wage — and withstand — political attacks.

So, with the next campaign season gearing up, what technology-driven changes can we expect for 2012? If the rumblings are true, this election may see the ascendancy of data science as a formal part of the campaign toolkit.

In a recent CNN article, Micah Sifry wrote about the Obama campaign’s establishment of a “multi-disciplinary team of statisticians, predictive modelers, data mining experts, mathematicians, software developers, general analysts and organizers.” The article goes on to discuss the importance of data harmonization (a fancy term for master data management), geo-targeting, and integrated marketing.

Obama may be struggling in the polls and even losing support among his core boosters, but when it comes to the modern mechanics of identifying, connecting with and mobilizing voters, as well as the challenge of integrating voter information with the complex internal workings of a national campaign, his team is way ahead of the Republican pack.

All this has some GOP supporters concerned. Martin Avila, a Republican technology consultant, states in the same article that he doesn’t think that anyone on the opposing side fully understands the power of organizing and analyzing all of this data. According to Avila, the current GOP use of information technology is still largely shaped by its pre-Internet experience in broadcast advertising.

In some ways, this cavalier attitude toward the value of data shouldn’t come as a complete surprise. One trait that many members of the so-called “party of business” share with executives in the private sector is a strong attachment to a “gut based” approach to making decisions.

A recent Accenture Analytics survey of over 600 managers at more than 500 companies found that senior managers rarely used data-driven analysis when making key business decisions and instead relied heavily on intuition, peer-to-peer consultation, and other soft factors. According to the study, 50% of companies weren’t even structured in a way that would allow them to use data and analytical talent to generate enterprise-wide insight. In addition, those organizations that did make analytics-based decisions often depended on inconsistent, inaccurate, or incomplete data.

Savvy voters, like savvy customers, have come to expect a certain level of performance and consistency from the IT systems they use. This is bad news for businesses that still think that things like social media, data analytics, and master data management are gimmicks:

Organizations that fail to tackle the issues around data, technology and analytics talent will lose out to the high-performing 10 percent who have leveraged predictive analytics to become more agile and gain competitive advantage.

Creating a structured program for better targeting and more efficient communications seems like a no-brainer these days, but, for now, there doesn’t seem to be a lot of competition.

Further Reading:

    • 1/30/2012 – Slate recently published an article that talks about the different philosophies guiding the development of Democratic and Republican voter databases. Catalist, an independent data initiative, is focused less on profit and more on becoming “an indispensable tactical resource for the American left” with a privately-funded data warehouse containing records of the entire voting-age population combined with other commercially available data. It’s customers include many traditionally liberal groups who consider the Democratic National Committee’s database insufficient. In response, the DNC has stepped up development of its own database, the Voting List Management Cooperative (or “Co-op”). In order to take advantage of the increased desire for voter information, the DNC has also developed statistical models that are particularly valuable for candidates. Meanwhile, the Republican National Committee established the Data Trust, a private company filled to the brim with former RNC staffers and committee members. The goal of this organization is to create robust voter profiles that can be shared with political allies. However, because of concerns about outside influence, the RNC is modeling it more along the lines of the DNC’s data co-operative instead of the more independent Catalist. The Data Trust development model is also less focused on data mining activities and more on basic data.
      7/17/2012 – Another Slate article. This one covers the Romney campaign’s attempt to boost its analytics efforts. Their initial approach appears to center on trying to figure out the President’s strategy by tracking his movements and breaking down his ad buys. This seems pretty reactive to me but time will tell.
  •