Rules of Thumb for Complex Visualization
Last time I discussed some simple graphing techniques that will make visualizations easier to grasp. This time I want to discuss some slightly more complicated techniques involving graphing more complicated data.
Basic graphs usually tackle only a few pieces of information:
Bar Graph:
#1: Players
#2: Home Runs
Pie Graphs:
#1: Players
#2: Plate Appearance Outcomes
Both of these graphs were simple. The players are constants, so they don't really need much special treatment (they are just labels), so we're really only graphing a single variable for each.
Line Graphs:
#1: Players
#2: Home Runs
#3: Over Time
This one is slightly more complicated, but we are all so used to seeing graphs over time like this that most of us wouldn't find it too complicated since while time varies, it only varies in a predictable way (we all know when tomorrow will be).
Lots of times we have to deal with more complicated data sets. Instead of just graphing a single unpredictable variable, what if we have to graph two of them?
To give a simple baseball example, let's say you want to see how many wins each team got in 2010, and how much money they paid in salaries.
There are two main ways you can do it. You can label the various points with colors and leave a legend, or you can label the points on the chart and use the same color. Here are examples:
Looks like confetti, doesn't it?
When dealing with this much data, there are just too many different colors to handle a different color for each, so if the point is to show each team, then it's probably better to go with labels. Note that the labels are a lighter color and pretty small -- if I made them black, they would draw attention away from the data, and focus the eyes on the labels -- that would make it harder to see the trends.
So when is color good? Let's say you wanted to make this chart to show the differences between AL and NL teams. That would be great for color:
The benefit of adding color here is that it tells a story. The AL has a much wider gap between the top teams in salary and the bottom teams. Boston and New York are just way ahead of the group, and the top NL teams in payroll are between #2 and #3 in the AL. Clearly there's a gap between the leagues.
You can adjust the colors to tell whatever story you want. Let's say we just want to focus on the AL East and how absurdly unbalanced they are.
I just made the AL East jump out by making them really dark, and all the other teams grey. This tells the story of how incredibly well Toronto and Tampa did considering their financial disadvantage. It also shows the huge gap between the "Haves" and the "Have Nots" in the game's most competitive division.
But let's say we want to make this even more complicated. Let's say we want to not only show data for one year, but show data for 3 years. If we throw all that data on a single graph, we get a giant mess.
Too many labels, no real trending, just a giant blotch of stuff. It becomes pretty hard to tell what is what. So we need to really focus on what we want to say rather than just dumping data on a graph.
For instance, if we want to take a look at AL vs. NL from 2008-2010 in payroll and wins, we can do that. But time doesn't really come into the picture if we do it this way, it just looks like a giant mess of dots, and doesn't help us with trending.
One way we can make things a bit clearer is by changing our colors around. The more recent the year, the more solid we make the color -- that way we can see trending visually over time.
For another example, if we just want to see where the AL East is flying around to in the grand scheme of things, we can color them in, connect the dots, and give an idea of how the division is trending.
That shows us a bit more. Baltimore and Toronto are cutting salaries but have improved their win totals in 2010. Tampa Bay is spending more and more each year. The Yankees are essentially treading water. And the Red Sox made a huge jump in salary for 2010.
The other teams are all grey, and they aren't color coded by year. That is a judgment call. Is that information important to your story? In fact, if you only care about the AL East teams, you could even remove all of the excess dots and show just the AL East information. This shows how the division is working internally without cluttering it up with all those extra details.
Another alternative is to make them even less visible so that the AL East data stands out more, but the other information is still out there.
This all comes back to the first rule of graphing: what story are you trying to tell? You should leave in information that is important to your story, and take out anything that doesn't matter. Personally I think that showing Toronto as a more-or-less middle of the pack team, Baltimore as among the perennial losers, and Tampa Bay as the thrifty winners is useful, so I like the last graph best. That tells the story of the AL East best in my mind. But always decide for yourselves.
The basic lesson here is that you can use color to add extra information to a graph, but if you have too much information jumbled together, no amount of color will save you.
Think about your message, make sure that the message is the most obvious thing in the graph, and play around until you find something that works well for you.
If you want to code more data than this, you need to move into the world of interactive graphing like Gapminder.org But that's a story for another day...
References:
- Win Data from Baseball Reference
- 2010 Salary Data from CBS Sports
- 2008-2009 Salary Data from About.com
I'm an expat living in Japan since 2003, doing sales and marketing work. More of my work is available on Henkakyuu, my personal blog. Also feel free to inspire me to use twitter more often @henkakyuu
15 comments
|
3 recs |
Do you like this story?
Comments
"Note that the labels are a lighter color and pretty small -- if I made them black, they would draw attention away from the data, and focus the eyes on the labels -- that would make it harder to see the trends."
Like you said, it depends on the story you’re trying to tell. If you want readers to be able to pick out individual teams and find their spot on the map, finding the label quickly is far more important than emphasizing the value.
One way to handle this is to replace the points with the labels themselves, but this is unacceptable when overlapping occurs, and is undesirable when accuracy is desired.
Great points overall. Good work.
Blogger and Editor, Rational Pastime Blog. Twitter: @RationalPastime.
The best is to
My Work: Henkakyuu. Entice me to use twitter more @henkakyuu
by jmaciel on Jan 11, 2011 5:18 PM EST via mobile up reply actions
The best is to Make them appear on mouseover
But that is a whole other bag of worms. And depends on the story. I am a macro guy (as in big picture, not VBA though I can do that too), so I prefer people looking at the overall trend first, before hunting down their team.
My Work: Henkakyuu. Entice me to use twitter more @henkakyuu
by jmaciel on Jan 11, 2011 5:20 PM EST via mobile up reply actions
What program did you use to make these graphs?
"These are thin mints. I put them in the freezer. My favorites. So good."
--Reds outfielder Adam Dunn, on the girl scout cookies he keeps in his locker
Pretty sure this is just excel.
2007 or later, which is how he’s doing the transparency on one of the latter graphs.
See Data Differently: Beyond the Box Score | @justinbopp
Man I gotta get better with excel
"These are thin mints. I put them in the freezer. My favorites. So good."
--Reds outfielder Adam Dunn, on the girl scout cookies he keeps in his locker
Mac Excel 2004
My Work: Henkakyuu. Entice me to use twitter more @henkakyuu
by jmaciel on Jan 11, 2011 5:17 PM EST via mobile up reply actions
Thanks
"These are thin mints. I put them in the freezer. My favorites. So good."
--Reds outfielder Adam Dunn, on the girl scout cookies he keeps in his locker
One thing I think is continually overlooked
by amateur charteers (what, you want “infographic artist?” fine!) is that almost everything is made entirely too small. MAKE IT BIGGER. Every browser has an autosize option now so there’s no excuse for making squinterriffics.
Make everything bigger. The chart, the datapoints (except where precision is required), the labels, the axis labels, everything. BIGGER. It works.
See Data Differently: Beyond the Box Score | @justinbopp
Yes.
And that way it’ll print! Like the poster I want of your Car-Go graph!
On Twitter: @baseballtwit
Great stuff again, jmaciel.
I should probably just break down and get Excel for the Mac. I do everything in OpenOffice, which produces godawfully ugly graphs. So that’s why I do everything by hand.
So, maybe it’s a GOOD thing that I don’t get Excel. Hmmm…
On Twitter: @baseballtwit
You can make some nice graphs in Excel
And I don’t think you really need to tweak things by hand much once you get the hang of it, but it requires having a pretty in-depth knowledge of the wackiness that is Excel charting. I recommend resources like www.peltiertech.com to get an idea of how to make the charts to begin with, and to use http://colorbrewer2.org/ to get a nice color scheme working for you (my graphs above used the default, which is fine for simple stuff, but when you want to visualize a few variables with gradients, color brewer is invaluable).
My Work: Henkakyuu. Entice me to use twitter more @henkakyuu
Any idea how to get your hands on software that would do the sorts of stuff gapminder does with your own excel data sets?
Animated, more complex, etc.? Thanks for the link to them, btw. Very interesting.
There used to be (still is?) an option on Gapminder to add your own data sets for analysis, but I don’t think that’s exactly what you’re after. You could make it in Excel relatively easily (relatively as in without even using VBA or the like), but you’d be limited by Excel in how many data points you could show on a graph.
My Work: Henkakyuu. Entice me to use twitter more @henkakyuu

by 









































