Category Archives: Programming

Anatomy of an Analysis (Part 1)

A few weeks ago, the BBC News produced a list of the top 100 greatest American films based on input from critics from around the world.

Here are the top ten films presented in rank order:

  1. Citizen Kane (Orson Welles, 1941)
  2. The Godfather (Francis Ford Coppola, 1972)
  3. Vertigo (Alfred Hitchcock, 1958)
  4. 2001: A Space Odyssey (Stanley Kubrick, 1968)
  5. The Searchers (John Ford, 1956)
  6. Sunrise (FW Murnau, 1927)
  7. Singin’ in the Rain (Stanley Donen and Gene Kelly, 1952)
  8. Psycho (Alfred Hitchcock, 1960)
  9. Casablanca (Michael Curtiz, 1942)
  10. The Godfather Part II (Francis Ford Coppola, 1974)

There is really nothing too surprising here. Perennial favorite Citizen Kane tops the list followed by The Godfather and Vertigo — two of the most famous films (by two of the most famous directors) ever produced. Perusing the full list, you might recognize a few other titles and maybe think about adding some of them to your Netflix queue. But that’s about it. Aside from a handful of ancillary stories, there was little additional commentary to draw you deeper into the story. Sensing an opportunity, I decided to use this list to demonstrate the steps involved in a quick and simple analysis of data found “in the wild.”

Here follows a demonstration of my 5-step program for data analysis:

Source

The BBC asked each critic to submit a list of the ten films they felt were the greatest in American cinema (“… not necessarily the most important, just the best.”). For the project, an “American film” was defined as any movie that received funding from a U.S. source. This criteria included many films by foreign directors as well as films shot outside of the country. The highest ranking films on each list received ten points, the next film down received nine points, and so on. The tenth pick received one point. All the points were then tallied to produce the final list.

Processing

Even though the resulting “listicle” is fairly simple, it contains a lot of interesting information just waiting to be freed from its icy confines. I pulled the list into Excel and used some very basic string (text) manipulation to create four basic fields from each row of information:
List_String_Manipulation_1

Additional manipulation of the “Year” field yields a useful grouping category:
List_String_Manipulation_2

With the creation of these five fields, I now have a flexible database instead of a rigid list.

Organization

The final data set was “stored” as a table in a simple spreadsheet. Although I have many problems using Excel for data storage (more on that in a future post), it is a quick and easy way to organize small sets of data.

Transformation

Once the data was in the format I wanted, I created a pivot table that allowed me to manipulate information in different ways. I was particularly interested in answering questions like “Who are the top directors?”, “When were most of these films made?”, and “Was there ever a ‘Golden Age’ of modern cinema?” Most of these questions can be answered through simple grouping and summarization.

Serve

After all that work, it’s time to pull together the results and display them in some way. For this exercise, that means a few simple tables and charts:

TOP 10 DIRECTORS IN AMERICAN FILM

Director # of Films in the Top 100
Stanley Kubrick 5
Steven Spielberg 5
Alfred Hitchcock 5
Billy Wilder 5
Francis Ford Coppola 4
Howard Hawks 4
Martin Scorsese 4
John Ford 3
Orson Welles 3
Charlie Chaplin 3

TOP YEARS IN AMERICAN FILM

Year # of Films in the Top 100
1975 5
1980 4
1974 4
1959 4
1939 3
1941 3
1977 3
1946 3
1994 3

TOP DECADES IN AMERICAN FILM
Top_Film_Decades_1

These simple presentation tools start to tell some interesting stories and — like all good analysis tools — start to hint at additional avenues of exploration. For example, while two of the directors with five films in the Top 100 (Kubrick, Hitchcock) also made it into the Top 10, the other two (Spielberg and Wilder) did not … why? The year with the most films on the list was 1975 … what were the films? The 1970s account for over 20% of the films on the list … what was going on in the culture that lead to this flowering of expression?

It would have been really great if the BBC article had included some sort of interactive tool that allowed readers to explore the database themselves. I will see what I can do to tackle this in an upcoming post.

Ideas Illustrated LLC Celebrates 10 Years

I was filling out the annual report forms for Ideas Illustrated LLC a few weeks ago and noticed that my original filing date was May 11, 2004… making today my 10th anniversary! It’s hard to believe that a full decade has passed since my wife and I sat around brainstorming ideas for a company. It’s been a fun ride so far, with a several great side projects, a well-regarded blog, and a lot of new challenges. It hasn’t made me a millionaire but it has put some extra cash in my pocket and probably saved my sanity on more than one occasion. Here’s to ten more years!

iiLOGO_white_10th_anniversa

Trends in NFL Football Scores (Part 1)

One of the goals I set for myself this summer was to learn a bit about D3, a visualization toolkit that can be used to manipulate and display data on the web. Considering that the trees are bare and we’ve already had our first frost here in Wisconsin, you can safely assume that I am behind schedule. Nevertheless, I feel that I’ve finally reached a point where I have something to publish, so here goes.

First of all, a little background. D3 is a JavaScript library that allows you to bind data to any of the elements (text, lines and shapes) you might normally find on a web page.  These objects can be stylized using CSS and animated using simple dynamic functions. These features make D3 a perfect tool for creating interactive charts and graphs without having to depend on third party programs like Google Charts, Many Eyes or Tableau.

I wanted to start out with something simple so I elected to go with a basic line chart using data I pulled from Pro-Fooball-Reference.com. This site contains a ton of great information and statistics from the past 90+ years of the National Football League but — for now — I just looked at the final scores of all the games played from 1920 to 2011. My first D3-powered chart is below. It shows the average combined scores of winning and losing teams for each year of the NFL’s existence.

Although this chart looks pretty simple, every element — including titles, subtitles, axes, labels, grids and data lines — has been created manually using the D3 code. The payoff is pretty nice. All of the elements can be reused and you have tremendous control over what is shown onscreen. To demonstrate some of these cababilities, I’ve added interactive overlays that show a few of the major eras in NFL football (derived from work of David Neft and this discussion thread). If you move your mouse over the graph, you will see these different eras highlighted:

Early NFL (1920-1933) – The formation of the American Professional Football Association (APFA) in 1920 marked the official start of what was to become the National Football League. This era was marked by rapid formation (and dissolution) of small town franchises, vast differences in team capabilities and a focus on a relatively low-scoring running game. At this time, the pass was considered more of an emergency option than a reliable standard. The rapid growth in popularity of the NFL during this era culminated with the introduction of a championship game in 1932.

Introduction of the Forward Pass (1933-1945) – The NFL discontinued the use of collegiate football rules in 1933 and began to develop its own set of rules designed around a faster-paced, higher-scoring style of play. These innovations included the legalization of the forward pass from anywhere behind the line of scrimmage — a change that is often called the  “Bronko Nagurski Rule” after his controversial touchdown in the 1932 NFL Playoff Game.

Post-War Era (1945-1959) – The end of WWII saw the expansion of the NFL beyond its East Coast and Midwestern roots with the move of the Cleveland Rams to Los Angeles — the first big-league sports franchise on the West Coast. This period also saw the end of racial segregation (enacted in the 30s) and the start of nationally televised games.

Introduction of the AFL (1959-1966) – Professional football’s surge in popularity led to the formation of a rival organization — the American Football League — in 1960. The growth of the flashy AFL was balanced by a more conservative style of play in the NFL. This style was epitomized by coach Vince Lombardi and the Green Bay Packers, who would win five championships in the 1960s. In 1966, the two leagues agreed to merge as of the 1970 season.

Dead Ball Era (1966-1977) – Driven in part by stringent restrictions on the offensive line, this period is marked by low scores and tough defensive play. Teams that thrived in this environment include some of the most famous defenses in modern NFL history: Pittsburgh’s Steel Curtain, Dallas’ Doomsday Defense, Minnesota’s Purple People Eaters and the Rams’ Fearsome Foursome.

Live Ball Era (1978-present) – Frustrated by the decreasing ability of offenses to score points in 70s, the NFL began to add rules and make other changes to the structure of the game in an attempt to boost scoring. The most famous of these initiatives was the so-called “Mel Blount Rule” (introduced in 1978), which severely restricted the defense’s ability to interfere with passing routes. With the subsequent introduction of the West Coast Offense in 1979 — an offense based on precise, short passes — this period became marked by a major focus on the passing game.

Having created this first chart, I decided to build a second chart based on the ratio of average winning scores to average losing scores to see if there were any patterns.

The chart above shows how — after a period of incredibly lopsided victories — the average scoring differential settled in to a very steady pattern by the late 1940s and stayed at that level (roughly 2:1) for the next 30 years. Despite many changes in rules, coaching techniques, technology and other factors, only the pass interference rules of the late 1970s seemed to have any signifcant effect on this ratio, shifting it to just under 1.8:1 for the next 30 years.

While I had the data available, I also decided to look at the differences in average scores between home teams and away teams. The chart below plots this data along with the same overlay I used in the first chart.

A look at the ratio of average home team scores to average away team scores follows:

What’s fascinating about this chart is how quickly a form of parity was acheived among all the NFL teams. By the mid-30s, a measurable home field advantage can be seen at roughly 15%, a rate that has remained essential constant for over 70 years. Factors for this boost could include the psychological support of fans, familiar weather conditions, unique features of local facilities, lack of travel fatigue, referee bias and/or increased levels of motivation in home town players.

Thanks to Charles Martin Reid for his solution to getting D3 and WordPress to play nice.

There’s a Person in There

My wife works as a speech and language pathologist for a local school district and her caseload is a mix of kids from ages 3 to 21. One of her most difficult cases involves a student with severe cerebral palsy who transferred into the district when he was 12-years-old. This student cannot speak or use his body to convey information and currently expresses himself primarily through eye contact and facial expressions. Because of this extremely limited range of abilities, his cognitive functioning is unknown.

The only real communication options involve interpretation of the student’s eye movements. Professionals can do this using a contraption called an eye gaze board, which is a simple frame to which you attach pictures or symbols. The professional sits face-to-face with the student, holds up the frame, and then prompts the student with a question. By observing where the student looks, the professional can make assumptions about their intended responses.

Needless to say, this approach has some limitations, particularly in this case. The student’s eye movements are hard for even professionals to interpret and the student himself tires very quickly. After careful consideration, my wife elected to see if the student could use an eye-tracking device in combination with a communication board or speech generating device (SGD) — a specialized computer that allows the user to build messages and relay them to others through a synthesized voice. (Dr. Stephen Hawking is a famous user of such a device.)

Users can access these devices directly using a keyboard or touch screen or they can manipulate them indirectly with a joystick, adapted mouse, optical pointer, eye tracking device, or other type of controller. The specific access method depends entirely on the abilities of the user and, in this case, there are not a lot of options. The student is quadriplegic and does not even have enough control over his head and neck movements to use switch access scanning, in which an indicator such as a cursor steps through selections automatically and the user hits a switch when the indicator lands on the desired choice. Blinking is also out for similar reasons.

A comparison of the two options shows some obvious advantages for the eye tracking option. Unfortunately, these devices are not cheap. While an eye gaze board can be assembled from five bucks’ worth of spare parts, a communication board and eye-tracking device cost about $8,000 apiece. No school district is going to spring for such a purchase these days so it became necessary for my wife to apply for a loaner and see if she could build a case for Medicaid.

.

Comparative Evaluation (Eye Gaze Board & Eye Tracking Device)

Eye Gaze Board Eye Tracking Device
Ease of Set-Up Easy Difficult
Ease of Listener Comprehension Difficult Easy
Y/N Response Accuracy 20-30% 80%
Number of Communication Functions 4 14
Size of Picture Field 4 pictures 12 pictures
Length of Time Before Fatigue 10 minutes 30-40 minutes
Maximum Length of Utterance 1 4+
Able to Fine-Tune Dwell Times No Yes
Able to Independently Introduce a Topic No Yes
Able to Communicate with Multiple Listeners No Yes
Able to Call for Attention No Yes
Able to Communicate with Non-Professionals No Yes
Able to Repair Communication Breakdown No Yes

.

1st Trial

The loaner — a Dynavox Vmax/EyeMax system — arrived in the last few weeks of the 2011 school year and came with some standard navigation screens or “boards” that are based on vocabulary and language ability levels. The user categories include — in order of ability — emergent communicators, context-dependent communicators, and independent communicators.

The primary choice for this case was between context-dependent, which means that the student’s ability to communicate depends on the environment, topic, or communication partner, and independent, which means that they are able to combine single words, spelling, and phrases together to create novel messages about a variety of subjects.

.

Examples of the Context-Dependent “Child 12” Navigation Page (left) and Scene (right)

       

.

Examples of the Independent Gateway “Child 12” Set-Up (left) and “Child 40” Set-Up (right)

       

.

These navigation boards make extensive use of picture communication symbols (PCS) and the Fitzgerald color coding system for language development. PCS are simply standard graphics whose meanings are easily understood while the Fitzgerald “key” system assigns colors to specific grammatical forms. The psuedo-3D appearance of the buttons looks a little dated to my eye but the perceived affordance may be necessary for some users. The program itself is highly customizable.

To create a message using the different boards, a user would navigate through the system and click on each component in turn until they were finished. For the purposes of measuring message complexity, each of these steps counted as one “navigational unit.”

A simple request for a sandwich might look like this in a context-driven environment (for an utterance of four navigational units):


The same request in a word-based environment would look like this (for an utterance of three navigational units):

My wife’s student’s communication level is context-dependent. However, the navigation boards available for context-driven communication were too complex for him to use and many of the topics simply weren’t relevant. (He would never use either of the above examples because he doesn’t eat solid food — all nutrients are provided through a gastro-intestinal tube.) To get around some of these issues, she programmed a customized board based on his particular abilities and interests.

Some of these modifications were fairly extensive. Since her student had no understanding of grammatical structure at this time, she simplified the color scheme so it only used three colors: orange for the “back” button, blue for any folder that could be opened, and gray for any item at the bottom of a decision tree. She also tightened up the button groupings to reduce difficult eye movements and eliminated any buttons that would appear “underneath” the back button to reduce navigation errors. Finally, she set the dwell time between 8.5 and 9 tenths of a second — the effective “window” for reading the student’s gaze accurately.

.

Customized Communication Board

       

.

The student was able to use the system for about 2 1/2 weeks in late Spring 2011 and for one week in Fall 2011. During the trial period, the student was able to use the twelve-button screen for several language functions, including basic greetings, requests, yes/no responses, exclamations, expressions of physical state, and even a few jokes (knock, knock jokes that my wife programmed into the computer). The range of communication partners included school faculty and several family members.

For casual observers, the student’s performance using the device was revelatory. One teacher who overheard the student working on a craft-related activity stated simply, “Wow, there’s a person in there.”

Although, it might seem obvious that such a tool would be beneficial for this particular student, the services were not deemed “medically necessary” and the initial request for Medicaid was denied. The evaluator felt that there just wasn’t enough evidence showing independent use of the system to create novel utterances. (Attempts to include some peer-appropriate language may have backfired when the evaluator dinged the student for overly frequent use of the phrase “smell ya later.”)

Another, longer trial was suggested.

2nd Trial

The next loaner arrived in April 2012 and my wife was determined to gather more quantitative data and provide as much documentation of the second trial as she could. Each of the student’s statements during the trial period were marked down and evaluated for complexity (number of navigational units or levels), conversational turns (the alternations or volleys between two speakers), and functions. Functions include descriptions (items, past events), requests (actions, information, objects), responses to requests, social devices (spontaneous calls, exclamations, greetings) and statements (emotions, future events, personal information, opinions). After four-weeks, there were 265 individual utterances available for analysis.

A few initial findings:

  • The student’s accuracy of responses to yes/no questions increased to 80% using the eye tracking device in conjunction with the SGD (compared to 20-30% on the eye gaze board).
  • The student’s ability to look at an item on command improved to 85%.
  • The student was able to comprehend all of the noun and verb phrases programmed into the device.
  • The student demonstrated comprehension of the following:  categories, colors, shapes, sizes, actions words, possessives, time words, words denoting quantity, pronouns and wh-questions.
  • The student spontaneously accessed the machine to call attention and participate in conversations with a variety of adults and peers.
  • The student combined multiple symbols to create a message and often used one symbol in novel ways. For example, he would use “bye” to indicate that he wanted to stop an activity.
  • The student demonstrated the ability to repair conversational breakdowns. After an unintended response, he would often use the method of multiple “clicks” on a word to emphasize his correctly intended response.

During the trial period, the student gradually shifted from single-level utterances to more complex navigational structures. By the second half of the trial, 61% of his utterances used a combination of symbols and the average length of utterance increased from about 1.6 navigational units during the first two weeks of the trial to over 1.8 navigational units in the second two weeks. A basic MS Excel t-test performed on this metric suggests that this change was significant.

.

Distribution of Utterances by Navigational Units (1 vs > 1)

.

Distribution of Utterances by Navigational Units

.

The mean score for Half 1 (M=1.605 SD= 0.727, N= 119) was significantly smaller than the mean score for Half 2 (M=1.836, SD=0.822, N= 146) using the two-sample t-test for unequal variances, t(261) = -2.42, p <= 0.016. This implies that the student has the attention, memory, and problem-solving skills to use a SGD to achieve his functional communication goals.

t-Test: Two-Sample Assuming Unequal Variances

Half 1 Half 2
Mean 1.605 1.836
Variance 0.529 0.676
Observations 119 146
Hypothesized Mean Difference 0
df 261
t Stat -2.42
P(T<=t) one-tail 0.008
t Critical one-tail 1.651
P(T<=t) two-tail 0.016
t Critical two-tail 1.969

Interestingly, many of the student’s more complex utterances were in conversations with peers — pre-teens with no training in speech and language communication. The student also increased the number of conversational turns per topic over time and, as with conversational complexity, his performance was better with his peers. He had longer conversational “volleys” and used many longer strings of symbols than his conversations with adults.

.

Navigational Units Comparison by Listener

Listener 1 2 3 4
Peer 45.6% 40.5% 8.9% 5.1%
Professional 46.2% 36.3% 16.5% 1.1%

.

Conversational Turns Comparison Over Time

Half 1 2 3 4 5 6
1 59.5% 25.0% 9.5% 4.3% 1.7% 0.0%
2 52.8% 27.1% 10.4% 5.6% 3.5% 0.7%

.

Conversational Turns Comparison by Listener

Listener 1 2 3 4 5 6
Peer 53.9% 23.7% 11.8% 6.6% 3.9% 0.0%
Professional 55.6% 27.8% 9.4% 4.4% 2.2% 0.6%

.

While there is no doubt that this technology would prove incredibly beneficial in this situation, the strict rules surrounding Medicaid requests makes the outcome difficult to predict. By carefully documenting the results of this second trial (and including some awesome tables and charts), my wife hopes to tip the scales in her student’s favor. The report was mailed yesterday so cross your fingers. As my wife’s student might say (through his technology-assisted communication device): “Let’s get this party started!”

Update:

  • June 22, 2012 – The request was approved. There is some hard work ahead but this is a big hurdle to clear. Congratulations and good luck to everyone involved!

Wisconsin Voters Banished to NULL Island

The top headline in my local paper this morning was “Glitch puts some Wisconsin voters in Africa” … an interesting thing to ponder over a bowl of Quaker Oatmeal Squares. I suppose this problem merits at least some attention given the heated political climate surrounding the state’s voter redistricting process. But headline news? Above the fold? Sounds like a slow news day to me.

Online, of course, the debate has already devolved into the standard round of mudslinging and name-calling so good luck trying to find out what’s going on from that crowd. The reporters themselves focused on the political fallout of the issue rather than an explanation so no help there either. I guess it’s up to the humble folks at Ideas Illustrated to offer up some insight!

The first clue to the problem can be found in the article’s pullout quote, which describes the voter’s location as the “coast of Africa” and not a specific country in Africa. The second clue can be found deep within the article when it is mentioned that clerks have recently made changes to the way voters are being entered into the voter registration database:

” … voters are [now] being entered into different districts by the physical location of their address in computerized maps. Previously, they were entered into different districts in the state voter database according to where their address fell in certain address ranges.”

These two hints point to a very common problem associated with geocoding, which is the process of converting a postal address to a set of map coordinates. Let’s backtrack. An online mapping tool like Google Maps uses specific geographic coordinates (latitude and longitude) to place a location on a map. However, because none of these physical locations are actually stored in a database anywhere, the tool needs to interpolate the coordinates from a vector database of the road network (i.e. a mathematically represented set of lines).

For example, if you look up the address for Trump Tower, you find that it is located at 725 Fifth Avenue in Midtown Manhattan. When you enter this address into Google Maps, the tool finds 5th Avenue on the underlying road grid and then uses an algorithm to determine that the “725” address is somewhere between 56th and 57th streets. It will also determine which side of the street the address is located based on stored knowledge of the “odd” and “even” numbering pattern. In other words, it’s guessing.

Google Map detail of the area around Trump Tower

 

TIGER/Line® Shapefile detail of the same area

These guesstimates work pretty well in dense urban environments where there are a lot of cross streets to serve as reference points. In rural areas, the curvilinear streets and widely-spaced buildings make things a little more difficult. When the situation gets really muddled, some mapping tools essentially “punt” and enter a default set of coordinates. In the case of the Wisconsin voter addresses, these default coordinates are 0.00 degrees latitude and 0.00 degrees longitude. Where is this exactly? It is the intersection of the Prime Meridian and the Equator … which occurs just off the coast of Africa.

Geographers have actually given this place a rather fanciful name called NULL island (it is not, in fact, a real island). It even has its own web site and unofficial flag (below right).

So there are no nefarious schemes behind this situation … just normal, everyday data problems. The state clerks need to tell their IT guys to flag the errant voter addresses and then they can assign them to the appropriate districts by hand. Problem solved. However, they should be aware that interpolation is an imperfect process and, in addition to assigning blocks of voters to NULL island, the geocoding process may also assign voters to the wrong districts. This could be particularly true for people who live close to a district boundary. It might actually make sense to keep the old method around for backup.

I Love a Good Hack

I can be a bit of a procrastinator sometimes and the rapid pace of software development can occasionally leave me with a lot of old and/or corrupt files lying around. Normally, that isn’t such a big deal but every now and then there’s a pretty important file I’d like to access … like a logo or a major document. Trying to jumpstart a web site is hard enough without having to redo a bunch of old work. That was my dilemma when I tried to open some old Illustrator files with a new version of the Adobe application. The old versions were just not compatible with the new program. Bummer.

Fortunately, the human race is a pretty creative bunch and the amazing technicolor interweb can help me find all sorts of tips and tricks for resolving these kinds of issues. The Illustrator version problem was cleared up by a simple text edit in the header lines of the Postscript file. Check out the whole solution here.

Simple and brilliant, my friend!

What I really like about this hack is the author’s use of the fact that all file formats are text files at heart … and some of them (like PostScript) are actually legible.