There is little doubt that my article on Monday about how likely it is for a batter to top a .400 AVG inspired some rather healthy debate, particularly at Baseball Think Factory. While some / many of the comments need to be taken with a pinch of salt, there were a couple of very important points that are worth re-iterating and bearing in mind as we think about the statistical analysis of baseball.
The first point of contention is whether batting averages are normally distributed. ACE1242 wrote:
Alas, Mr. Beamer's numerical conclusions derive from an erroneous premise. Player AVG's aren't normally distributed; they're quite sharply skewed. Below a certain AVG, Darwin kicks in and eliminates that player from the MLB population. In such a sample space, the concept of "standard deviation" isn't particularly useful.
So is this true? Ace is correct in that Darwin does kick-in, but does this mean that AVG isn't normally distributed? It isn't too difficult to work out. If we plot a histogram of AVG we would expect to see the classic bell curve shape. Indeed doing this we see that this is so (cut-off here is 200 at-bats).
Although Darwin weeds out poor hitters, this largely turns out to be below the three standard deviation level where 99.5% of hitters reside. A more statistical way to quantify whether batting average is normally distributed is to run a Kolmogorov-Smirnov test. A K-S test allows us to compare observed values to expected values for normality. In other words whether the data we observe fits with out hypothesis that the data are normally distributed. Running this shows that the data are normally distributed. Technically the data are actually binomially distributed as there are two outcomes: hit or no hit, but over a large number of trials the binomial approximates to the normal (especially with the 200 cut-off used in the preceding chart).
The second important discussion is whether AVG really follows a binomial distribution. Now for an event to be binomial a couple of conditions have to be met. First, the probability of an event has to be the same for each trial. For instance if a batter is likely to hit .300 in one at-bat but .350 in a second at-bat then technically the binomial distribution breaks down. Second, successive trials must be independent. What we mean by this is that say a batter is facing a particular pitcher then the outcome of his first at-bat doesn't influence the out-come of his second at-bat.
Whatever makes you think that the outcomes of at-bats are independent and identically distributed? In fact, I think quite the claim is quite false. Firstly, it's easier to get a hit off Tyler Walker than Pedro Martinez. Secondly, players have patterns across the season - for example, there are players who are regularly slow starters, and players who regularly fade towards the end of the season. Thirdly, a player hitting .320 may be more confident than one hitting .280. Fourthly, there is some evidence that hitters adjust during the course of a game and so hit better towards the end than at the beginning.
If you treat the separate ABs as independent events, I will think you're making a shaky assumption, but probably a necessary one to get some kind of working model so it's not the biggest deal. But by treating the events as identically distributed, you're making a huge error.
Perhaps I am confusing you. To give a concrete example of this, consider the weather. Suppose that in Boston there is a 50% chance it will rain on any given day. In Calcutta, during the rainy season there is a 90% chance it will rain on any given day, and during the dry season there is a 10% chance it will rain on any given day. Each season lasts six months. Call Rb the number of rainy days in one year in Boston, and Rc the number of rainy days in one year in Calcutta.
E(Rb) = E(Rc), but the distributions of these two variables are very different. For instance, it should be intuitively clear that Var(Rb) > Var(Rc).
Now this is certainly an interesting argument but does it really apply to baseball? Another contributor, Walt, posted a lengthy response:
Two Alous is technically correct but the differences between the assumptions of independence and constant p and reality are likely to be trivial.
First, some background on the binomial. The binomial distribution is the sum of a set of independent "Bernoulli trials" with a constant probability. A Bernoulli trial is essentially a coin flip (though the probability doesn't need to be .5). The expected value of the trial is p and the variance of the trial is p*(1-p). If each Bernoulli trial is independent and the p is constant, then we can sum them to form a binomial variable. If N is the number of trials/flips then the mean of the binomial is N*p and the variance is N*p*(1-p). The standard deviation is then the square root of the variance. It's not intuitively obvious but each Bernoulli trial is itself a random variable and the variance of the binomial variable is just an application of some simple covariance algebra. If a and b are two random variables AND THEY'RE INDEPENDENT then:
VAR(a + b) = VAR(a) + VAR(b)
In the case of a binomial variable, you may be adding dozens of independent random variables (each Bernoulli trial or each AB in our example) so the above formula is a bit cumbersome and N*p*(1-p) saves a lot of time.
If the trials aren't independent then we have to include the covariance of each pair of trials. But it seems to me that it would be quite difficult to build an argument that these covariances would be substantial -- it's not like getting a hit in one AB turns you into a 500 hitter in the next AB. Seems like it couldn't possibly add more than a few points to your BA in the next AB. And in the real world, such covariance is likely to be positive meaning that the actual variance is larger than the estimate which should increase slightly the probability of a 400 hitter. I can't rule out lack of independence as a major problem, but it really seems unlikely to me.
Note, the "batters do better in their 3rd or 4th AB against a pitcher that game" is not (necessarily) an example of lack of independence. Lack of independence occurs if the OUTCOME of the 2nd AB impacts the p of the outcome in the 3rd AB. The "3rd/4th AB argument" is really similar to the "BA varies by pitcher quality" -- i.e. the p of a hit is larger in the 3rd AB than the 2nd just like it's higher against Lima than Pedro. To argue lack of independence you'd have to show that getting a hit in the 2nd AB increases (or decreases) the p of getting a hit in the 3rd AB. The "confidence" hypothesis is closer to a lack of independence argument.
If we can assume independence, then the variance of the binomial variable will still equal the sum of the variances of the individual Bernoulli trials EVEN IF THE P VARIES FROM TRIAL TO TRIAL. If the p varies, then the variance will also differ from trial to trial ... but it won't vary much. If a "true 300 hitter" varies between a p of .250 against good pitchers to .350 against bad pitchers, the variances will range between .1875 and .2275 (under constant p, the variance of each trial would be .21).
Over 600 AB, assuming constant p, the variance would be 126 and the sd 11.2. Over 600 AB, assuming that range of non-constant p, the variance probably comes out to 120 (or probably a smidgen higher) and an sd of 11. That difference is much less than 1 point of BA. Even if the true p varies from .150 to .450, I think you're still talking about no more than maybe 2 points of BA.
OK, that is the independence argument, but what about the assertion that successive trials aren't identically distributed. Again TwoAlous wrote:
... this rests on the flawed assumption that a hitter will face equal numbers of good and bad pitchers, which is not necessarily true. For example, Pedro Martinez is likely to face Miguel Cabrera more times per season than Moises Alou. He will never face David Wright. In addition, one season Miguel Cabrera may never face Pedro Martinez. Another season he may face him a lot. This random chance as to what pitchers a batter faces needs to be taken into account, and it will significantly raise the variance.
Twoalous then goes on to propose a slightly more sophisticated approach for answer the originally question about how likely it is for a batter to ever hit .400 in a given season:
But if I were going to try to calculate the probability of hitting .400, I'd proceed as follows:
1. Obtain (or estimate) the distribution of "true" BAA for MLB pitchers.
2. Calculate the expected BA for a hitter of each "true talent level" against pitchers with BAA of x.
So, for example, if a hitter with "true talent level" .350 faces a pitcher against whom the league hits .300, what is the probability he gets a hit?
3. Obtain (or estimate) the distribution of pitching talent faced by a hitter.
4. Use above information to calculate probability of hitting .400 for each "true talent level."
That's quite a bit of work, though.
It certainly is a bit of work! I personally, and it is worth reiterating that this is my own personal opinion, think that Twoalous' approach, would result in an almost identical answer. Here is what I wrote on BTF in response:
TwoAlous - I think your proposed methodology suffers from some of the same shortcomings that have been discussed above. Essentially the above thread rebukes the attempt to quantify what the batter's true skill level is. As you point out, how can you do this when batters face different pitchers, or play in different parks, or had greater odds of scoring later in innings? The answer is that unless you take all these variables into account you can't actually determine what the true skill level of a batter is.
The same also has to be true for pitchers. All the same points that we make above for batters hold. For example Chris Carpenter never has to face Albert Pujols, but has the pleasure of pitching Brad Ausmus. There are also other extraneous factors like those mentioned above. This is not withstanding the fact that if we follow the argument to an end then one can say that even though pitcher A is a .300 pitcher, against batter B he becomes a .400 pitcher for whatever reason (some batters just seem to have the number of certain pitchers, right).
Batting and pitching are intimately linked. That is what baseball is about: match-ups. Also another thing that needs to be considered is that BABIP isn't a hugely repeatable skill (admittedly this will be accounted for by the distribution around the expected pitcher skill level), but is still a factor which adds noise and uncertainty to what we are trying to do.
I agree that your proposed approach is probably more technically accurate, but my personal view is that the results wouldn't alter much (I guess from your post that you probably don't agree - which will make doing the analysis an interesting exercise for both our sakes). Ultimately, over the course of 162 games, 600 at-bats, I genuinely believe that effects like Pedro not facing David Wright etc. are not important. Pedro has to face Chipper Jones, but John Smoltz doesn't. To move from a .350 average to a .400 average requires an extra 30 hits, which is not an insubstantial number. And I can't see any other factors which will affect the skill outcome by more than the odd hit here or there - over the course of an entire season - now, there is of course random variation that needs to be accounted for on top of that.
Ultimately there are an almost infinite number of variables that can be taken into account. We need to use our judgement to choose the most relevant approach.
There are two things that I have taken away from the thread at BTF. First, as baseball analysts it is critical that we need to be super careful about qualifying, testing and stating our assumptions. This is something which I was perhaps guilty of not doing in the original article. This left some of the conclusions open to debate, which a careful analysis of the assumptions would have avoided. Second, we need to understand to what level we push our analysis. For instance in analysis of clutch hitting, or line drives, or DIPS where there isn't much repeatable skill, getting an additional level of data precision and granularity will be extremely valuable. For some debates (like this one) there is little point in pushing the analysis to a new level which is unlikely to yield greater understanding. There are two views of the world: an engineer's view and a particle physicist's view. The engineer is concerned with the functionality / cost trade-off while the physicist wants precision and accuracy at any cost. The approach we need to take depends on the question we are trying to answer, and how constrained we are by data and time. To work out the likelihood of a batter hitting .400 the engineer's approach is superior.