Tag Archives: Data Visualization

Infographics and Data Visualization (Sign Up)

Despite a crazy schedule, I’ve decided to sign up for a free online course offered by the Knight Center for Journalism called Introduction to Infographics and Data Visualization. It runs from October 28 – December 8 and will be taught by Alberto Cairo, the author of The Functional Art: an Introduction to Information Graphics and Visualization, published by PeachPit Press. I will be sure to post my completed assignments here. It should be fun!

You Are What You Watch

Experian-Simmons released some survey data in December that looked at the relative popularity of major television shows for three different political groups: liberal Democrats; conservative Republicans; and middle-of-the-road voters. Each show was given an index based on the concentration of specific voters and this information was used to create lists of the top programs for each political group in both entertainment and news categories.

Although these top ten lists were interesting on their own, the fact that each individual TV program actually had an index rating for all three groups offers an opportunity for more complex analysis. The most obvious next step involves comparing pairs of groups in a 2D scatterplot chart. The Tableau visualization below shows the results.

A few notes:

  • Entertainment shows are in blue, news shows are in orange.
  • Shows without enough data for a particular group were still plotted as a zero index.
  • Hovering over each data point reveals the show and its indices.

 

The first thing I noticed was that news shows were much more partisan than entertainment shows. In fact, almost all of the shows with the most extreme scores were either news shows (primarily FOX and MSNBC) or fake news shows (Comedy Central’s Daily Show and Colbert Report). PBS gets a few high scores on the liberal side but the standard television networks are all fairly evenly watched.

Another thing that strikes me is how similar the watching habits of middle-of-the-road voters are to those of conservatives Republicans. The only noticeable exception occurs with news programs, but it is a pretty big exception: FOX News. All of the top ten conservative news programs were all on FOX while none of the top middle-of-the-road news programs were on that network. It might be encouraging for conservative politicians to see the similarities in entertainment interests between conservative voters and independents but I suspect that the gulf in news sources would be hard to overcome.

Many of the other differences have been noted elsewhere but are worth repeating: liberal Democrats tend to favor funnier shows and stories involving morally complex characters while conservative Republicans favor shows where people are doing stuff — either real work or reality competitions.

Of course, having complained about the lack of 2D analysis for this data in the major online outlets, I would be remiss if I didn’t point out the fact that each show has three indices apiece. Logically, we should be trying to show the data in a 3D scatterplot.

This isn’t as easy as it sounds since most of the major charting applications aren’t very good in 3D and they don’t provide any interactive option for the web that I could find. The best options seemed to be R or something called CanvasXpress — neither of which I had worked with before. I chose R, which allowed me to create both static and interactive 3D plots. However, only screenshots of the interactive plot are available at the moment. Several hours later …

Much Ado About Coughin’

Whether you know it as whooping cough or the 100 days’ cough, pertussis — a bacterial infection that causes severe coughing fits — is no fun. According to Wikipedia, it affects nearly 50 million people annually and causes almost 300,000 deaths worldwide. Although most of these deaths occur in developing nations, pertussis is the only vaccine-preventable disease that is associated with increasing deaths in the U.S.

Pertussis can be particularly dangerous for young children, so health departments keep a pretty close eye on local outbreaks and ask parents to keep their kids home from school while undergoing treatment. Unfortunately, the infection is very contagious and early symptoms are pretty mild. Combine this with some parental fears surrounding the vaccine and you’ve got a pretty good recipe for the occasional quasi-epidemic.

This year’s “winner” in the whooping cough stakes is apparently Wisconsin. As of April 21, 2012, the CDC estimates that the Badger State has had over 1,000 cases of pertussis, which is about as many cases as all of the Pacific Coastal states combined. Among these unlucky cheeseheads were the two fully-vaccinated kids that currently live under my roof. (My wife speculates that they picked it up at an extremely packed showing of The Hunger Games.)

Now that the quarantine period is over and my two little data points are on the mend, I thought it would be interesting to use some of the CDC data to experiment with Google Charts. I was especially interested to note that Google had a treemap feature. In the chart below, the size of the rectangles represents the current number of whooping cough cases, while the colors represent the increase or decrease over the same period in 2011. (Note: in the revised treemap option, the size of the rectangles represents the current number of whooping cough cases per million in population.)

Pretty simple example, no drill downs or tooltips for now.

U.S. Cases of Whooping Cough (April 21, 2012)

Toggle Between Cases and Cases per Million



Oh, and if you’re looking for Minnesota or Oklahoma, neither state has any current cases.

My favorite online example of a treemap is the Map of the Market on SmartMoney.com. The navigation is very robust and you can nest groups of categories on the primary display. Google’s product allows you to drill down several levels but I couldn’t figure out a way to combine them in one view. I also like the way SmartMoney’s chart allows you to display additional information about each element when you hover over it with your mouse. I suspect that this is possible with the Google version but it is not explicitly called out in the documentation.

Does it work? For comparision, here is the same data in a standard Google bar chart:

U.S. Cases of Whooping Cough (April 21, 2012)

The bar chart results in a lot of whitespace and it needs to be much bigger in order for all the bars to fit. I tried a bubble chart as well (below) but there are limitations for this format, too. In particular, clumps of bubbles are difficult to read. I had to transform the data using a logarithmic scale to spread the shapes out a bit.

U.S. Cases of Whooping Cough (April 21, 2012)

2012 Cases vs. Cases per Million (Size=Population)

Visualizing English Word Origins

I have been reading a book on the development of the English language recently and I’ve become fascinated with the idea of word etymology — the study of words and their origins. It’s no secret that English is a great borrower of foreign words but I’m not enough of an expert to really understand what that means for my day-to-day use of the language. Simply reading about word history didn’t help me, so I decided that I really needed to see some examples.

Using Douglas Harper’s online dictionary of etymology, I paired up words from various passages I found online with entries in the dictionary. For each word, I pulled out the first listed language of origin and then re-constructed the text with some additional HTML infrastructure. The HTML would allow me to associate each word (or word fragment) with a color, title, and hyperlink to a definition.

The results look like this:

The quick brown fox jumps over the lazy dog.

This simple sentence is constructed of eight distinct words and one word suffix. Six of the words are from Old English (colored in pink) while the others are from Gallo-Roman and Middle Low German (both colored in gray). Hovering over each word provides the exact source and clicking the word takes you to the full origin description.

A second example shows more variety:

Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.

This is a surprisingly complex Monty Python quote where the colors represent Old English (pink), Middle English (red), Anglo-French (orange), Old French (light orange), Middle French (pale orange), and Classical and Medieval Latin (both yellow). I suspect that both the complexity and variety of word sources is intentional — standing in humorous contrast to the appearance of the speaker.

What follows are five excerpts taken from a spectrum of written sources. The intent was to investigate each passage and see if word origin varied significantly based on the intended purpose of the passage.

(This process was pretty involved and my initial dream of creating an app that would allow me to convert any paragraph to this format faded when I realized that much of the word matching process needed manual intervention. I definitely suggest digging in to the full etymology site to explore the full history of each word. I have probably made plenty of translation mistakes as I developed my paragraphs but I certainly had fun.)

Passage #1: American Literature

The first paragraph I looked at was an excerpt from Mark Twain’s The Adventures of Tom Sawyer. I chose this text because I thought it would have a good mix of English and American words.

Tom gave up the brush with reluctance in his face, but alacrity in his heart. And while the late steamer Big Missouri worked and sweated in the sun, the retired artist sat on a barrel in the shade close by, dangled his legs, munched his apple, and planned the slaughter of more innocents. There was no lack of material; boys happened along every little while; they came to jeer, but remained to whitewash. By the time Ben was fagged out, Tom had traded the next chance to Billy Fisher for a kite, in good repair; and when he played out, Johnny Miller bought in for a dead rat and a string to swing it withand so on, and so on, hour after hour. And when the middle of the afternoon came, from being a poor poverty-stricken boy in the morning, Tom was literally rolling in wealth. He had beside the things before mentioned, twelve marbles, part of a jews-harp, a piece of blue bottle-glass to look through, a spool cannon, a key that wouldn’t unlock anything, a fragment of chalk, a glass stopper of a decanter, a tin soldier, a couple of tadpoles, six fire-crackers, a kitten with only one eye, a brass door-knob, a dog-collar but no dog the handle of a knife, four pieces of orange-peel, and a dilapidated old window sash .

 

The passage has a solid base of Old English words mixed with a variety of French, Latin and Old Norse terms. Middle English makes an appearance in the form of a few words and suffixes while American English is found solely in the list of items Tom Sawyer collects from his friends. Two of these American terms (“fire-crackers” and “door-knob”) are hyphenated words built from Old English and Scandinavian components. (Several of Twain’s other hyphenated words apparently didn’t make it over the hump into full-fledged Americanisms. However, it should be noted that Twain was often the first author to record usage of U.S. slang of the era.)

I found it interesting that Middle English had such a poor showing in this text but it may be due to the fact that the defining elements of Middle English have more to do with sentence structure and grammatical elements than specific words. I was also surprised at the frequent use of longer, Latin-based words in an adventure novel, but the average word length comes in at about 4.4 characters — still fairly short and simple.

Although 73% of the word fragments are Old English, Twain uses words from over a dozen different sources in this short passage alone. Overall, the wide variety of word sources adds a pleasing “flavor” to the passage. The mix seems well-balanced and interesting.

Passage #2: British Literature

For my second test, I wanted to look at text from a non-American author. I chose a paragraph from Charles Dickens’ A Tale of Two Cities Great Expectations out of respect for my 7th-grade English teacher.

My sister had a trenchant way of cutting our bread-and-butter for us, that never varied. First, with her left hand she jammed the loaf hard and fast against her bib where it sometimes got a pin into it, and sometimes a needle, which we afterwards got into our mouths. Then she took some butter (not too much)on a knife and spread it on the loaf, in an apothecary kind of way as if she were making a plaister using both sides of the knife with a slapping dexterity, and trimming and moulding the butter off round the crust. Then, she gave the knife a final smart wipe on the edge of the plaister, and then sawed a very thick round off the loaf: which she finally, before separating from the loaf, hewed into two halves, of which Joe got one, and I the other.

The relative simplicity of this passage surprised me a little. The average word length is about 4.2 and over 84% of the word fragments are basic Old English. No other source comes in over 5% and the variety of sources is half that of the Twain passage. American English Hebrew makes an appearance in the form of the name “Joe” but most of the other borrowed words are French in origin. Still, I found the text appealing in a way — basic words for a basic task.

Passage #3: Legal

The third paragraph comes from a United Nations document on maritime territories. I selected this passage because it seemed to contain more jargon and I suspected that much of this jargon was borrowed. This hunch proved to be correct.

Where the coasts of two States are opposite or adjacent to each other, neither of the two States is entitled, failing agreement between them to the contrary, to extend its territorial sea beyond the median line every point of which is equidistant from the nearest points on the baselines from which the breadth of the territorial seas of each of the two States is measured. The above provision does not apply, however, where it is necessary by reason of historic title or other special circumstances to delimit the territorial seas of the two States in a way which is at variance therewith.

This text had a much higher ratio of French and Latin word fragments (16.9% and 9.3%) and a longer average word length — nearly 4.8 characters — than both previous passages. With 64.4% of the word fragments, Old English still serves as a major binding agent in this text but there is less variety overall. Middle English makes its appearance only as a suffix and there is only one word outside of the English/French/Latin triumvirate. After the visual and poetic excitement of the two literature entries, this paragraph seems very bland.

Passage #4: Medicine
Note: This passage has been revised (see thread)

My dad suggested that I take a look at a healthcare-related passage to see if the use of specific medical terminology would tilt the word usage even farther away from “native” English words. Boy, was he right.

The anatomic axis of the lower extremity is defined by the femorotibial angle, which averages 5° of valgus; the mechanical axis of the lower extremity is defined by a plumb line connecting the center of the femoral head to the mid ankle on a standing anteroposterior weight-bearing radiograph. The mechanical axis averages 1. 2° of varus, and it is more accurate than the anatomic axis in demonstrating load transmission across the knee joint, especially if femoral or tibial deformities contribute to limb malalignment. A study by Khan et al in patients with early symptomatic knee osteoarthritis showed a clear relationship between local knee alignment (as determined from short fluoroscopically guided standing anteroposterior knee radiographs)and the compartmental pattern and severity of knee osteoarthritis. In this study, each degree of increase in the local varus angle was associated with a significantly increased risk of having predominantly medial compartment osteoarthritis, and a similar association was found between the valgus angulation and lateral compartment osteoarthritis in 47 knees. osteoarthritis in 47 knees.

The medical paragraph has only 51.9% Old English word fragments and the average number of characters per word is 5.7 — much higher than even the legal text. French Latin, and Greek were used more frequently in this passage and, despite U.S. prowess in the healthcare field, there were no American English terms. This is a paragraph that is doing a lot of heavy lifting and it uses a lot of dense, muscular words to get the job done.

Passage #5: Sports

This last passage was an attempt to stack the deck in favor of some home grown words. It doesn’t get more American than baseball, but the only American word in this article about a spring training rainout between the Milwaukee Brewers and the Texas Rangers is the word “baseball” itself. Everything else is either Old English or borrowed. Still, I have to assume that phrases like “at-bats” and “suicide squeeze bunt” are not exactly common constructions and my guess is that the entire article would be a mystery to someone who didn’t know the game.

It was a wild, windy day at Maryvale Baseball Park before the rains came with the Brewers ahead, 6-4. The Brewers scored their runs on a throwing error, a delayed double steal, a wind-blown popup that fell in shallow center field, a fielding error on that same play, a wind-aided triple and a suicide squeeze bunt, all in three innings of at-bats.

The triple belonged to Caleb Gindl, who motored to third after Rangers center fielder Craig Gentry crashed into the wall, forcing open a large gate. Gentry and left fielder Conor Jackson worked together to close it so play could continue.

It was crazy out there, Gindl said. it was scary in the outfield. After a while we were all just playing deep, knowing the ball would either get to us or blow out.

Play didn’t last long after the Brewersfour-run third inning. Brewers reliever Manny Parra pitched a scoreless fourth, then the grounds crew covered the field before the bottom of the inning could begin.

After a delay of just 12 minutes, the game was called.

First of all, I absolutely LOVE the fact that Caleb Gindl uses two Old Norse words to describe the weather conditions during the game. It provides a certain primal, unhinged quality to the situation and adds a third element — nature — to the contest. I also like the use of the onomatopoeic terms “pop” and “crash” because they serve to underscore the action.

The passage itself is a little lighter on the French and Latin roots than some of the earlier paragraphs and many of the terms are fairly short — the average word length comes in at about 4.6 characters. Some of this may be due to the fact that it is an online article (and attention spans are short) but it may also related be to the simple concepts at the core of the game itself. Words like “bat” and “ball” are very similar to their proto Indo-European roots (*bhat- and *bhel- respectively), suggesting that any associated activities are pretty basic to the language. Also, the sheer number and variety of numeric references (e.g. “three”, “third”, and “triple”) bring in many simple terms.

Geographic References in Local Business Names

This little exercise came about after I read an article on the old Northwest Territory in the U.S., which basically consisted of all the land west of Pennsylvania, northwest of the Ohio River, and east of the Mississippi River. As the country expanded westward, this geographic area gradually became known as the “Midwest” (or the East North Central States region) but not before the older name left its mark on the local culture. Organizations like Northwestern Mutual Life (Milwaukee) and Northwestern University (Chicago) still refer back to to the days when these places were located on the fringe of the country, not at its center.

It occurred to me that researching such place names would be a good way to see if there was still a residual “shadow” of the old Northwest territory so I downloaded a sample list of company headquarters with the phrase “Northwest” or “Northwestern” in their names and plotted them on a map. Alas, this attempt failed to find anything significant (there was too much competition with the Pacific Northwest in name usage). However, I did look up some other regional terms with more positive results.

 

The geographic patterns for most of these terms are fairly distinct but there are also some areas of overlap. It was especially interesting to see regions that had local businesses in three or more categories. The old Northwest territory fits this mold with a combination of Midwest, Great Lakes, and Prairie.

What ‘The Office’ Gets Wrong About the Office

I start a new job next week and so I’ve been working on documenting all of my old tasks and projects in preparation for the transition. As I was going through old e-mails, I came across the introductory note my manager sent out to the department on my first day back in June 2004. Comparing it to the departure e-mail from my current manager, it’s amazing to see the changes in personnel over a seven-year span.

I prepared this chart using the distribution list from both e-mails, a drawing program, and a site that creates proportional venn diagrams. Only eight people are listed twice — including me and a person who left the company and has since returned. Some of the people who are only listed once have more tenure then me — they just may have gone to/come from another department. Still, it represents an interesting fact about the modern office. Change is constant.

Thick as a [LEGO] Brick

A few weeks ago, Samuel Arbesman wrote an article in Wired touching on the mathematical properties inherent in LEGO structures. In it, he discussed the results of a 10-year old study of natural and human-made networks that described how the number of distinct components in a network increased with the overall size of the network.

The study showed that the LEGO systems did indeed follow this rule. However, Arbesman noted that the relationship increased sublinearly, suggesting that LEGO systems were under some form of selection pressure (like the economics of production) that made it more expensive to grow the system and create new types of pieces. He was curious to see whether or not these findings would hold true with a more complete list of LEGO sets available today (n=389 in the 2002 study).

After using a webcrawler to pull the data for the available sets and their component pieces, I was presented with a list of over 6,800 individual toys or kits. Not all of these kits fit the criteria of the original study, which investigated sets that were designed to build somthing specific as opposed to generic collections of pieces.

Paring down this list turned out to be the most difficult part of this excercise. I ended up eliminating any set had words like “accessories,” “supplemental,”  or “universal building set” in the name. I also removed entire toy lines such as DUPLO, Clikits, and Primo/Baby which didn’t seem to fit in the standard LEGO system. Basically, I tried to include anything with a brick, plate, or tile that had a picture of a single object on the box. I ended up with about 3,750 sets … or about ten times the number in the original study.

So, do the results hold up with the new data? At first glance, it appears they do. Both the log-log and semi-log plots described in the study are reproduced here with the larger counts. Note that a power-law relationship still appears to fit the data better than a logarithmic relationship.



Once I had access to all of that cool LEGO data, of course, I couldn’t resist a few more visuals. The first thing I developed was an interactive chart that lets you navigate the size and complexity data to see specific kits. Check out the links for pictures and parts lists.

This display was interesting because the LEGO kits with the most pieces tended to be elaborate secret bases or fortresses while the LEGO kits with the most variety of pieces were cultural artifacts like the Taj Mahal or the Statue of Liberty. Ironically, the Death Star (which might be considered both a cultural icon and a fortress) fits neatly in the upper right corner.

The following charts look at the trend of unique pieces over time as well as the distribution of color over the distinct LEGO sets available (this includes all LEGO products, not just the specific “objects” used in the logarithmic plots above). Note both the increasing variety of the LEGO pieces and the move away from the traditional color palette. The mottled gray represents the “other” category.

It is interesting to note that the shift toward more complexity in both pieces and colors corresponds with the deal LEGO inked with Lucasfilm in 1999 that allowed the company to sell toys based on the “Star Wars” universe. These changes came at a time of turmoil for LEGO as it struggled to remain true to its roots while competing with a flood of specialty toys and video games. Licensing products from Lucasfilm was a big step for LEGO but one that seems to have paid some creative dividends … four five of the top ten largest LEGO structures ever released commerically are spaceships from the “Star Wars” series.

This trend toward replicating such specific visions (LEGO has also licensed themes from Harry PotterToy Story, Pirates of the Caribbean, and others) explains some of the incredible variety of pieces now in circulation. Items from these new kits introduced many pieces used only once.

On the opposite end of the spectrum, the most commonly shared LEGO piece in the database is a black 1 x 2 plate (part number 3004). The other pieces in the top 10 are also very simple and very monochromatic. I found it interesting that all the colors in the top ten reflected the sequence of Berlin and Kay’s basic color terms (in which Stage I cultures have only the colors black (dark–cool) and white (light–warm) and Stage II adds Red).

One thing this database does not cover is the huge market for non-standard kits and free-form LEGO bricks. According to Chris Anderson’s Long Tail blog:

“… 90% of Lego’s products are not available in traditional retail. They’re only available in the catalogs and online … [o]verall, those non-retail parts of the business represent 10-15% of Lego’s annual $1.1 billion in sales. “

User-created structures represent an amazingly creative use of the standard set of parts available.  Check out this footbal stadium or this minifig-scaled Saturn V rocket. Some of these models were created using the old LEGO Factory/Design by Me software but some are done on the fly. It would be interesting to see if some of the above findings apply to these custom structures.

For more stats and a company timeline, check out this site.

Updates: