In general, baseball fans, both casual and intense, are generally most interested in the best or worst players. The merely average does not seem to interest us. You'll notice more breath, ink, and bandwidth is used discussing Mike Trout (.321/.412/.573, 174 wRC+ in 2012-2014) vs. Miguel Cabrera (.331/.405/.600, 170) or Dan Uggla vs. Historical Lows than James Loney (.279/.329/.386, 100) vs. Denard Span (.280/.333/.389, 101).
Despite this interest in extrema, we generally don't look too much into the discussion of extrema in sabermetrics, or at least I haven't. Usually, it's more common to see where the average league (Fill in the statistic) to try and predict the direction of the league as a whole, whether it is to try and identify the next offensive explosion of Year of the Pitcher.
BABIP can be affected by a variety of variables, but one of the first things to come to mind would be pitch location. With this in mind, Kevin Ruprecht endeavors to calculate a BABIP for each zone.
In my next few articles, I'm going to briefly discuss the modeling of extremes and apply this thought process to a number of questions that could interest both sabermetricians and more casual fans. There will be some math along the way, but if you are only interested in the answer to the question of interest, you can skip to just getting your answer at the end. But first, we need to start with a little bit of statistical theory...
The Statistics of Extrema
Warning: Gory Math ahead. Feel free to skip this section if you are so inclined.
The statistical theory of extrema (Minimums and maximums) is a fascinating business. Theory for averages is very well established, and is commonplace to the point that I've taught it in undergrad statistical methods courses.
Extrema require a little more thought and work. To begin, we need to get some notation. Assume that all our data points Xi are a sample coming independently from some distribution with a c.d.f. F(X) and p.d.f. f(X). Also, say that samples can be ordered from smallest to largest X(1)...X(n).
Now, in general, we want to look at the distributions of the minimum and maximum. Knowing the distribution can give us the expected values, variances, probabilities that can allow us to evaluate the extrema. Now, you can get a general idea of these values through bootstrapping, but this is unnecessary as the general form of these distributions has already been worked out. I will spare you the details, but the distributions of the minimum X(1) and maximum X(n) are given below.
From this point, this knowledge can be applied to modeling any extrema---in fact any order statistic---as long as one is willing to make distributional assumptions on the data. Over the next few articles, I'll be applying this method to a variety of stats and questions, beginning with...
The .400 Season
In a game where, according to Ted Williams, succeeding 3 out of 10 times allows you to be considered a good performer, it's understandable why the .400 season would be so revered. Since 1901, 8 different players have compiled 13 total seasons where the reached the "batter's perfection." Every name is well-known, but let's list their accomplishments for a moment.
|Shoeless Joe Jackson||1911||0.4081|
There have been many notable runs at the fabled number, including George Brett in 1980 and Tony Gwynn in 1994. But every attempt in the nearly 75 years since Ted Williams has fallen short. The general thought today is that the .400 season is gone forever. To quote Beyond the Box Score's own Anthony Joshi-Pawlowic,
[The .400 season seems] pretty lofty given the direction the game is going in. With the progression towards more shifts, specializations, platoons, and Sabermetrics in general; I'd imagine either would be a long shot...
The Chances of a .400 Season
But how gone is the .400 season? Is it really gone forever in its entirety? To look at that, we need to go into the distributions for the league leading batting average for each year. In this case, we'll only be looking at the batting averages of qualified players for obvious reasons. No one talks about Bob Hazle's 1957 .403 batting average alongside Cobb and Hornsby...because it happened in 155 PAs.
Again, to look at the distributions we're interested in we need to make a few distributional assumptions. As batting average is theoretically bounded between [0,1], using a Beta distribution is appropriate choice. However, for each year, the two parameters of the distribution need to be estimated. This was done through a method of moments type of estimation involving matching the sample weighted mean and sample weighted variance back to expected value and variance of the Beta distribution. It's worth noting that qualified players tend to be better than average as well as less variable.
Once these parameters are estimated, the distribution of maximum average for a sample of n players, where n is the number of qualifying players in a given year, can be determined. From there, the probability that that maximum exceeds .400 can be easily determined. So, let's look at that probability through the years.
The black points are the probabilities in seasons where a batter in fact reached .400. As can be seen, Ted Williams's 1941 season by far had the lowest probability of occurring. So where are we today? In 2013, with the distribution of batting averages for qualified players as is, the probability of the highest batting average exceeding .400 in that environment is 0.0010. In 2014 so far, that probability is upped to 0.0036.
Another way to look at this is to find the expected maximum average in that year. This will strongly correspond to these probabilities. Again, the black points represent years with a .400 average.
The Best Batting Average Seasons
Finally, as we have a distribution that allows us to calculate the expected value and variance of the leading average, it is possible to look for the best league-leading seasons in baseball history. There are many ways to go about this, but the easiest method is calculating the probability that a max batting average would be greater than the observed. With that in mind, the most (And least) impressive leading hitting seasons are...
So yeah, it seems the .400 hitter is gone. Like, real gone. It's not technically impossible, but if the future probability that the leading hitter reaches .400 is the same as the average yearly probability from 1942-2014 (0.0033), our grandchildren's grandchildren's grandchildren's grandchildren could be expected to come and go without seeing one.
. . .
Data courtesy of FanGraphs.