I've written a bit recently about how statistics often trip us up by running counter to our intuition. This is a difficult problem, and if it's one that interests you, I recommend you read some behavioral economics, beginning with Dan Ariely (author of Predictably Irrational). But one of the nice things about online communities is that they allow community auditing. What do I mean by that?
There are plenty of blogs about baseball on the internet, and even a good many that are about baseball and take a rather quantitative approach to analyzing the game. These blogs cross-pollinate ideas and link to one another; in short, this is what we mean when we refer to a blogosphere (a word which, ugly though it is, describes an otherwise unnamed phenomenon). And it seems to me that especially when it comes to baseball statistics, people on the internet get really excited about proving other people wrong (no, I'm not naming any names).
Could this be a good thing?
Table of Contents
Some tasks are best executed by a single person. Aphoristically, a room full of monkeys on typewriters might eventually write Shakespeare, but rarely has great literature sprung from the minds of a group of people. When it comes to art, it's a benefit to have a single author with individual personality, desires and idiosyncrasies.
Other tasks are completed fastest, most efficiently, and even most accurately, when left to large groups of people. This wasn't always apparent. For example, this TED talk makes the point that, 15 years ago, nobody would have expected a free collaborative encyclopedia that did not pay its authors could compete with the Microsoft behemoth. But of course, nobody uses Encarta anymore. Anybody out there not have Wikipedia in his/her browser history? (You're lying.)
So how does this relate to the sabermetrics blogosphere? For the seventh time now, Tom Tango has opened his Scouting Report, By the Fans, For the Fans for balloting. It's an interesting idea. From his introduction:
Baseball's fans are very perceptive. Take a large group of them, and they can pick out the final standings with the best of them. They can forecast the performance of players as well as those guys with rather sophisticated forecasting engines. Bill James, in one of his later Abstracts, had the fans vote in for the ranking of the best to worst players by position. And they did a darn good job.
So he takes this idea of crowdsourcing and applies it to individual defense. All he had to do was create a ballot and a system to tabulate the results, and get as many people to vote as possible. After all, the more ballots, the lower the random error. Perhaps because they are simply aggregated intuition, the results accord fairly well with intuition. For example, the top three defenders on the 2008 Atlanta Braves were Yunel Escobar, Mark Teixeira, and Mark Kotsay.
But wait a minute. Crowds might be pretty good at figuring out some things, but are they really good at evaluating performance? And didn't I say just two days ago that human beings are actually pretty bad at avoiding bias?
Certainly, you would never crowdsource player projections. If you did, you'd probably end up with all kinds of mistakes. Imagine the crowdsourced projection for Brad Lidge's performance this year? Sure, some commentators might have pegged him for regression, but most people probably would have taken 48-48 at face value.
So why then is it a good idea to crowdsource defense? Because of the alternatives. We are getting better at defensive statistics. The best, probably UZR, are what I would call "not terrible." However, there are some pretty sophisticated projection systems out there for hitting and pitching performance. But that simply isn't the case with defense.
So, by working together, fans are able to improve on what exists already. And I'd be willing to bet that the next great defensive statistic will be written about first on the web, open-source, and freely available for all. Until then, go evaluate your favorite team's players.
You knew that Google was also a calculator, right? But did you know it could also help you to calculate statistically significant player slumps? That's what Ian Ayres, writing at Freakonomics, says:
Over his career, A-Rod has averaged one homer for every 14.2 at bats — suggesting there is about a 93 percent chance that he will not homer on any individual at bat. It would be crazy to say that he was in a home-run slump after failing to homer after just a few at bats. But the question is how many homer-less at bats is enough to be a statistically significant drought?
The answer is 42.
(Of course, we already knew the answer was 42.)
But how did he arrive at that figure?
Athlete is having a statistical significant drought if:
Total consecutive number of bad events > log(.05)/log(probability of single bad event)
You can copy and paste the right-hand side of this inequality into Google, plugging in the probability of a single bad event (yes, Google is a calculator):
For A-Rod going homer-less, you would Google: log(.05)/log(.93).
Where .05 is the confidence level (95%) and .93 is the probability of ARod not hitting a home run (based on 14.2 HR/AB career). It's a pretty nifty trick.
I would also add that, if you have Excel handy, it's pretty easy to go the other way. For example, if I wanted to know the exact probability of ARod going 42 at bats without a home run, I could simply enter the following formula:
=BINOMDIST(42, 42, 0.93, 0)
Where the first 42 is the number of desired outcomes, the second 42 is the number of trials, .93 is again the probability of ARod not hitting a home run, and 0 is a binary variable telling Excel that we are not looking for a cumulative probability (you almost always want this at 0 when doing this kind of calculation). If we punch that into Excel, it gives us back a simple number, in this case .0475, indicating a 4.75% chance.
Between Google and Excel, there isn't a whole lot of math that can elude you.
Speaking of math that eludes me, here's an interesting thread I've picked up. It started back in March, with Larry at wezen-ball.com. He wondered if two baseball games had ever played out identically:
I did this by looking at every game in the database and finding any games that had identical end-game statistics to it. If two games had the same number of innings played and identical home- and road- runs, hits, errors, and men left-on-base, I marked them as a unique pair. There were 3,479 such pairs of games.
You can read his full method, along with a list of the closest games in the Retrosheet era, at the link. But recently, someone with a mathematics blog (God Plays Dice) came across his post and wondered if perhaps Larry's criteria were overly strict:
"[I]dentically" is defined a bit too strictly; (say) a groundout to second and a groundout to shortstop are counted as different. And the metric that the author uses for similarity of two games A and B is, I think, the number of times where the nth plate appearance in games A and B had the same outcome. Intuitively I think you'd want to line up innings with each other. Two "most similar" games should at least have similar-looking line scores. I think what one wants is some notion of "edit distance" between games, and defining that is hardly trivial.
Hardly trivial indeed. This is baseball we're talking about!
Ok, enough joking around, it's time to bump up against he limits of my mathematical knowledge. I think what we want here is a specific kind of edit distance, which in computer science is a way of describing how many changes you would have to make to a string in order to transform one into the other. For example, the edit distance of "one" and "two" is three, because you have to change each letter. There are different kinds of edit distance, depending on what counts as a fair move (can you transpose?).
For a baseball game, I don't think we especially care whether the groundout came before the single or vice-versa, as long as the outcomes were similar. So it would seem most appropriate to use something called Damerau-Levenshtein distance, which is described by Wikipedia (take that Encarta) as:
Damerau–Levenshtein distance is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters.
I think if we could calculate a Damerau-Levenshtein distance for all sufficiently similar games and rank them in ascending order, we'd have a pretty good answer to the question of which two baseball games were most similar.
Comp. sci. geeks: now is your time to shine. Can we make this happen? Am I even right about which edit distance would be most appropriate? Is this a fool's errand? Please, I'm in far over my head, and I need a little help from my friends.
And no, I'm not going to let you get away without linking to Joe Cocker.