Every so often your compere, Marc Normandin, posts a conundrum to his writing staff. And would you believe it, one popped into my inbox just the other day. His question was how likely is it for a hitter to post a batting average of .400 in a season? The last time that the .400 mark was breached was Ted Williams 1941. He became the 35th player to join the exclusive club. That we are still waiting for number 36 should give an indication of how rare a feat it is in the modern game.
Batting average (AVG) is a much maligned stat in sabermetric circles. The problem with it is that it correlates less well with runs and wins than other on base statistics such as OBA and OPS, not much less well mind you, but still less. Actually I am a AVG fan, if for no other reason than nostalgia: it was probably the first baseball stat I learnt. At its root it measures how effective a batter is at making contact with the ball and getting on base. Sure, it doesn't distinguish between those who ding every ball over the 400 marker and those who squeeze it between 2nd and short, or how successful a batter is at discriminating between a ball and a strike, but other stats cater for that. As long as the stat is used in its right context what is the problem?
Hey, did I tell you that the last time someone hit over .400 was 1941? That is over 60 years ago. In recent times Tony Gwynn got mighty close in 1994 with a .394 average but few have threatened the magical .400 barrier. Here is a list of the top 20 averages since 1961:
Looking at this table gives us a first clue as to an important determinant of batting average. Thirteen of the twenty of the highest averages have been in the last decade or so. Is this because the quality of hitting is getting better? Such sweeping statements are difficult to substantiate; the only conclusion that we can draw with certainty is this spike in AVG coincided with a rise in the run environment:
Distinguishing cause and effect is difficult: who knows if batters really have got better, or if the explosion of below average relievers has led to more hits, or if steroids have been to blame, or if the umps have been ceding a more generous strike zone, or if Aliens have somehow altered the space-time continuum to create a weird electromagnetic attraction between ball and bat. But whatever the real reason the more runs per game the higher the number of hits. And since the number of players on a team hasn't increased (it was still 9 last time I checked!) then, unless walk rates have exploded (they have not), AVG must have gone up. Indeed tracking this over the same 45 year period shows this to be so.
You'd imagine that the closer the average AVG is to .400, then the more likely it is for an individual batter to achieve such a mark. This is what the data above confirm.
Can we put a bit more analytical rigor around just how likely it is for a batter to rack up a .400 AVG? Absolutely. Batting average is a binomial statistic. Every time a hitter comes to the plate there are two outcomes (that impact AVG): hit or no hit. This means that we can apply a binomial distribution to work out the amount of spread in the data as a result of random variation. Just by using two pieces of information, probability of getting a hit and number of trials, we can work out the odds that a league average player will hit .400. The probability of getting a hit is simply AVG, and number of trials is at-bats. The average AVG in 2005 was .264 and the minimum at-bats to qualify for the batting title is 400, so plugging in the numbers we find that the expected standard deviation is 0.019. This means that some 67% of players have an average between .242 and .286. If we extend the range we are considering to three standard deviations then the binomial distribution says that 99.5% of players will hit between .198 and .329! That is still plenty from .400.
Hang on a cotton picking minute. Many players have beaten the .329 mark, and the modern greats like Pujols and Bonds have topped it with unerring regularity. We know that they are better than .270 league average batters. So we also need to factor batter skill in to the equation. How can we do this? If we compare the expected distribution of AVG with the actual distribution, then the difference is directly attributable to batter skill. For the 2005 season we worked out that the expected standard deviation is 0.019 (assuming that all batters are of equal skill). Calculating the actual standard deviation is a little trickier as batters have a different number of at bats. We can overcome this by weighting each batter's AVG by the inverse of at-bats. This gives us a standard deviation of 0.022. We know from binomial theory that the random variance and skill variance are independent. By subtracting the square of the standard deviations (ie, the variance) and taking the square root we can work out that skill accounts for 0.011 points of AVG.
So what does this mean? It means that if we are regressing to the mean then dividing expected population variance by skill variance we get 1300 at bats before we can regress 50% towards the mean, or just over three seasons worth (for 400 at bats per season). At first glace this seems a little surprising. To check if we are right we can look at the year to year correlation:
We get an r^2 of 0.09, which indicates that AVG isn't as a repeatable skill as first thought. What we forget is that there are a select number of batters who are just exceptionally good. In fact if you check out some 50% PECTOTA rankings, Pujols aside, only 10 players register a weighted mean projected AVG of over .300 for 2006. In reality we know that many more will probably clear this hurdle.
OK, that is enough debating the merits of batting average as a predictive statistic. Let's use our knowledge of the binomial distribution to work out how likely it is for batters with different skill levels to hit .400. Again this is done using our new best friend, the binomial distribution.
Another way of interpreting the table is to say that a batter with a .350 skill level would have to play for an eye watering 35 seasons before the likelihood of him hitting .400 is 50%. Yikes. This is supported by PECOTA. Take a look at the top 5 hitters in the game from last year. Below are their weighted mean and 90% PECOTA projections.
Again no-one gets anywhere near the .400 barrier. Not even Pujols and Bonds' 90% projections come close, and these are probably two of the greatest contemporary hitters in the game. Face it. The Royals are more likely to win the World Series than we are to see a .400 season. Next time you see someone like Tony Gwynn hit .394, stop, admire, and appreciate that you have witnessed something that is just really special.