I don't know about you all, but I've been enjoying the phoenix-like rebirth of our regular feature Graph of the Day. Much credit is owed to Justin Bopp and Walter Fulbright, who clearly know their layers from their masks. So I thought this might be a good time to talk about the elements of a good visualization.
Table of Contents
Data
Presentation
Finished Product
Discussion Question of the Day
The single most important function of any visualization is to convey data. Different bits of data are best visualized in different ways.
The preeminent authority on the presentation of data in visual form is Edward Tufte, and his most acclaimed work is The Visual Display of Quantitative Information. In it, Tufte gives examples of what does and does make for an effective visualization.
One of Tufte's most essential points is that the manner of visualization is dictated by the data, not the other way around. Even though Microsoft Excel makes it very easy to turn just about any data set into any type of chart doesn't mean it's a good idea. For example, when the sum of the data is important, a pie chart often fails, because it does not give any visual indication of the magnitude of the sum. It is a representation of relative size, not absolute magnitude.
A related point is that good visualizations are dense. That doesn't mean they are hard to unravel, but rather that they contain a great deal of information per square inch (or pixel). Tufte's classic example is this visualization, created by Charles Joseph Minard in 1812. (For a sarcastic cautionary tale, see Tommy Rancel's incredible pie chart here.)
Now, it might take you a minute to untangle that, especially if you don't speak French (zut alors!). But what you're seeing are several visualizations at once: (1) a map of Napoleon's progress toward, and retreat from, Moscow (note the rivers), (2) a representation of the size of his armies (and their offshoots, with width indicating size), and most incredibly, (3) a representation of the temperature at each stage of the march (see the line chart at the bottom, with corresponding indicators to the chart above--higher is colder).
This is what we would call a dense visualization. Did I mention he did it in 1812? Think about that the next time you create a chart in Excel...
Presentation is an important part of any visualization, and the better a chart looks, the easier it is to spend lots of time deciphering it. I can claim no expertise, but there are a few basic guidelines for clean presentation of visual information.
First, the use of colors should be such that they are easy to distinguish (even for those who suffer from color-blindness). But more than that, it is important to remember that color is potential axis upon which information can be conveyed. Color coding is a good way to visualize one variable, thus increasing the density of the information presented.
Second, a good chart is clearly labeled. Notice the plethora of labels included on Minard's chart above. There are dozens of labels, and they are all as close, spatially speaking, to the item they label as possible. The more complex a visualization gets, the more difficult it can be to follow lines connecting labels to data points or series. One of things I appreciate about Justin and Walter's charts is they way they superimpose images of players onto the bars representing their performance (for an example, see here). It's an easy way to create a label that is tied very closely to the data.
Part of the inspiration for me venturing down this road was the work of Dave Allen, who writes for Fangraphs and Baseball Analysts. He's a programming-savvy guy, and he is probably best-known for using the language R to create compelling heat maps based on PITCHf/x data. You can see a characteristic example of his work here. In the vast world of PITCHf/x graphs, Allen's are among the easiest to decipher, even to the lay fan. For that reason alone, his work is praise-worthy.
But I was particularly impressed with this particular chart, taken from this article. Allow me to reproduce:
Allen explains:
Most batters have more power on pulled balls and pull more inside pitches. So is Pena's outside power from opposite field power or from an ability to pull outside pitches[?] To examine this I took inspiration from Max's work looking at relationship the between the horizontal location of a pitch and the horizontal angle of the resulting ball in play. In this case I just looked at Pena's HRs. Remember that -45 is the third base line and 45 is the first base line.
Here's what's great about this visualization: it's dense, it's high-contrast, and it conveys information in a way that could not be done without a graph. This last part is particularly important. There would be no other way to understand the central point (that Carlos Pena pulls pitches on the outside half of the plate to right field) without this kind of graph. A table couldn't; a pie chart couldn't; a bar graph couldn't; an x-y plot couldn't.
The upshot is to reinforce the points made above, which Allen clearly understands. It also goes to show that knowledge of programming can be extremely useful in visualizing.
Discussion Question of the Day
When I showed this graph to a friend, he said he wished such a chart existed for every player. And I agree.
But it led me to another question. I have been thinking about how one might visualize AVG, OBP and SLG on a single chart to display the shape of a player's performance.
What data series do you think would go together nicely on a graph?