clock menu more-arrow no yes

Filed under:

Sabermetric Extrema: The .400 Season

New, 2 comments

The .400 season has long gone the way of the dodo, but every few years, there seems to be a run at the fabled number. But what realistically are its chances from year to year? By looking at the distributions of extrema in statistics, we can get a pretty decent idea.

Ted Williams, the last .400 hitter
Ted Williams, the last .400 hitter
Andrew Weber-US PRESSWIRE

In general, baseball fans, both casual and intense, are generally most interested in the best or worst players. The merely average does not seem to interest us. You'll notice more breath, ink, and bandwidth is used discussing Mike Trout (.321/.412/.573, 174 wRC+ in 2012-2014) vs. Miguel Cabrera (.331/.405/.600, 170) or Dan Uggla vs. Historical Lows than James Loney (.279/.329/.386, 100) vs. Denard Span (.280/.333/.389, 101).

Despite this interest in extrema, we generally don't look too much into the discussion of extrema in sabermetrics, or at least I haven't. Usually, it's more common to see where the average league (Fill in the statistic) to try and predict the direction of the league as a whole, whether it is to try and identify the next offensive explosion of Year of the Pitcher.

In my next few articles, I'm going to briefly discuss the modeling of extremes and apply this thought process to a number of questions that could interest both sabermetricians and more casual fans. There will be some math along the way, but if you are only interested in the answer to the question of interest, you can skip to just getting your answer at the end. But first, we need to start with a little bit of statistical theory...

The Statistics of Extrema

Warning: Gory Math ahead. Feel free to skip this section if you are so inclined.

The statistical theory of extrema (Minimums and maximums) is a fascinating business. Theory for averages is very well established, and is commonplace to the point that I've taught it in undergrad statistical methods courses.

Extrema require a little more thought and work. To begin, we need to get some notation. Assume that all our data points Xi are a sample coming independently from some distribution with a c.d.f. F(X) and p.d.f. f(X). Also, say that samples can be ordered from smallest to largest X(1)...X(n).

Now, in general, we want to look at the distributions of the minimum and maximum. Knowing the distribution can give us the expected values, variances, probabilities that can allow us to evaluate the extrema. Now, you can get a general idea of these values through bootstrapping, but this is unnecessary as the general form of these distributions has already been worked out. I will spare you the details, but the distributions of the minimum X(1) and maximum X(n) are given below.

Maxdist_medium

Mindist_medium

From this point, this knowledge can be applied to modeling any extrema---in fact any order statistic---as long as one is willing to make distributional assumptions on the data. Over the next few articles, I'll be applying this method to a variety of stats and questions, beginning with...

The .400 Season

In a game where, according to Ted Williams, succeeding 3 out of 10 times allows you to be considered a good performer, it's understandable why the .400 season would be so revered. Since 1901, 8 different players have compiled 13 total seasons where the reached the "batter's perfection." Every name is well-known, but let's list their accomplishments for a moment.

Name Year AVG
Nap Lajoie 1901 0.4265
Shoeless Joe Jackson 1911 0.4081
Ty Cobb 1911 0.4196
Ty Cobb 1912 0.4087
George Sisler 1920 0.4073
George Sisler 1922 0.4198
Rogers Hornsby 1922 0.40128
Ty Cobb 1922 0.4011
Harry Heilmann 1923 0.4027
Rogers Hornsby 1924 0.4235
Rogers Hornsby 1925 0.4028
Bill Terry 1930 0.40126
Ted Williams 1941 0.4057

There have been many notable runs at the fabled number, including George Brett in 1980 and Tony Gwynn in 1994. But every attempt in the nearly 75 years since Ted Williams has fallen short. The general thought today is that the .400 season is gone forever. To quote Beyond the Box Score's own Anthony Joshi-Pawlowic,

[The .400 season seems] pretty lofty given the direction the game is going in. With the progression towards more shifts, specializations, platoons, and Sabermetrics in general; I'd imagine either would be a long shot...

The Chances of a .400 Season

But how gone is the .400 season? Is it really gone forever in its entirety? To look at that, we need to go into the distributions for the league leading batting average for each year. In this case, we'll only be looking at the batting averages of qualified players for obvious reasons. No one talks about Bob Hazle's 1957 .403 batting average alongside Cobb and Hornsby...because it happened in 155 PAs.

Again, to look at the distributions we're interested in we need to make a few distributional assumptions. As batting average is theoretically bounded between [0,1], using a Beta distribution is appropriate choice. However, for each year, the two parameters of the distribution need to be estimated. This was done through a method of moments type of estimation involving matching the sample weighted mean and sample weighted variance back to expected value and variance of the Beta distribution. It's worth noting that qualified players tend to be better than average as well as less variable.

Once these parameters are estimated, the distribution of maximum average for a sample of n players, where n is the number of qualifying players in a given year, can be determined. From there, the probability that that maximum exceeds .400 can be easily determined. So, let's look at that probability through the years.

P400_medium

The black points are the probabilities in seasons where a batter in fact reached .400. As can be seen, Ted Williams's 1941 season by far had the lowest probability of occurring. So where are we today? In 2013, with the distribution of batting averages for qualified players as is, the probability of the highest batting average exceeding .400 in that environment is 0.0010. In 2014 so far, that probability is upped to 0.0036.

Another way to look at this is to find the expected maximum average in that year. This will strongly correspond to these probabilities. Again, the black points represent years with a .400 average.

Emax_medium

The Best Batting Average Seasons

Finally, as we have a distribution that allows us to calculate the expected value and variance of the leading average, it is possible to look for the best league-leading seasons in baseball history. There are many ways to go about this, but the easiest method is calculating the probability that a max batting average would be greater than the observed. With that in mind, the most (And least) impressive leading hitting seasons are...

Season Player AVG E(MaxAVG) P(MaxAVG>Actual)
1977 Rod Carew 0.388 0.349 0.0062
1980 George Brett 0.39 0.350 0.0065
1941 Ted Williams 0.406 0.361 0.0123
1957 Ted Williams 0.388 0.351 0.0273
1924 Rogers Hornsby 0.423 0.385 0.0313
1963 Tommy Davis 0.326 0.335 0.8598
2012 Buster Posey 0.336 0.347 0.8716
1903 Honus Wagner 0.355 0.369 0.8741
1990 George Brett 0.330 0.341 0.9011
1938 Jimmie Foxx 0.349 0.362 0.9157

So yeah, it seems the .400 hitter is gone. Like, real gone. It's not technically impossible, but if the future probability that the leading hitter reaches .400 is the same as the average yearly probability from 1942-2014 (0.0033), our grandchildren's grandchildren's grandchildren's grandchildren could be expected to come and go without seeing one.

. . .

Data courtesy of FanGraphs.

Stephen Loftus is an editor at Beyond The Box Score. You can follow him on Twitter at @stephen__loftus.