In a piece a couple of weeks ago, I motivated the creation of a statistic PresAvg through the example of Dan Uggla's hitting streak. Ideally, it was envisioned as a better predictor of the probability of getting a hit.
I originally had planned for that to be the only occurrence of PresAvg in an article, although it generated some interest, along with questions about its predictive ability. As I had not had time to really test it (The article idea had been conceived, executed, and written in about a day), I couldn't really answer those questions at that time. Well, I've since had time. The rest of this piece is about the sample used, the test run, and the results gained from seeing how predictive PresAvg actually is.
As PresAvg's ability to deal with injuries or other circumstances that would cause a long time off between games is unknown, the 50 players with the most plate appearance in 2012 were collected. From there, their game logs were all obtained, and their RecAvg and OvAvg were calculated.
Generalized Linear Models
So, to test the predictive ability, I decided to regress the player's performance in the next game against their PresAvg prior to the game. In this case, a linear regression is not the correct technique to use. As the response (Hits and At Bats in the next game) is not a continuous quantity, the Normal assumption of linear regression is not appropriate. Even if you took a player's batting average in the game, a simple linear regression is not appropriate.
A more correct technique to use is a generalized linear model. The GLM assumes that the response y comes from some distribution f(yi) with mean E(yi)=μi. Then it assumes that there is a linear relationship between some function of μ and your data. That is,
Where g( ) is a user-specified function. So, in this case, we regress PresAvg against the results of the batter in the next game. In order to evaluate how effective the model is, the model deviance is used. It is somewhat similar to a R2 value in a linear regression, as it does relate to how much the response is explained by the predictor. However, in the case of deviance, a smaller deviance value is better.
Also, allow me to note that I will not be looking at p-values to determine the significance of the predictors. This is because the sample size is large (over 7,500 observations), and as sample size increases, p-values are driven down toward zero. In fact, as the sample size goes to ∞, p-values will go to 0.
So, let's look at how predictive PresAvg actually is. I created 11 PresAvg statistics with different weights for RecAvg and OvAvg. These weights ranged from 0 to 1, so a PresAvg statistic created by PresAvg=0*RecAvg+1*OvAvg is essentially just predicting by the batting average, and PresAvg=1*RecAvg+0*OvAvg is predicting only by the batter's recent state.
|RecAvg Weight||OvAvg Weight||Model Deviance|
So, the results were not quite as I'd hoped. In just any general at bat, batting average does a slightly better job fitting the model. I say slightly because on the deviance scale with 7,500 observations, a difference of about 15 is really quite small.
Also important to note is the first row of the table, where the weights are both 0. This is known as the Null Deviance, and it's a baseline measure of deviance. Basically, in this application, this is just predicting the next game based on the league's batting average with no player-specific information whatsoever. So really, no metric is predicting all that well.
Disappointed by this, I went back to the creation of PresAvg. It was motivated by Dan Uggla's hitting streak, so I decided to see how PresAvg does with prediction during a streak-type situation. So, I took the dataset and cut it down to only instances where a player's 7-game lagged batting average was higher/lower than their overall batting average for 8 consecutive days. This accounts for a batter having gone on a solid two weeks of streaking (Not necessarily a hitting streak) or slumping. The deviance table is shown below.
|RecAvg Weight||OvAvg Weight||Model Deviance|
So here, the original PresAvg estimate from before does slightly better. This flip in a form close to PresAvg (With weights of 0.4 for RecAvg and 0.6 for OvAvg) being better occurs about when a player's 7-game lagged average is higher/lower than overall batting average for 3 consecutive days. As the length of streak gets longer, the RecAvg predicts the best by a larger margin, although the original PresAvg is reasonably close behind. Despite all this, we are still not seeing much better fit from the league overall average. This is disappointing, although it's not necessarily surprising.
So, in the end, what do we have here? PresAvg does not provide a great estimate of future results. Neither does batting average, RecAvg, or any combination of the two. In any random at bat, batting average is slightly better at predicting a player's results in the next game. As a player begins a hot or cold streak, PresAvg becomes a better predictor, with the margin of "doing better" becoming larger and larger as the streak gets longer. Once the streak ends, it goes back to overall batting average being slightly better.
Laplace's "question of probability" is therefore still unanswered. Frankly, it might stay unanswered, as baseball is a very noisy game. Maybe inclusion of history of batter vs. pitcher may improve things. Regardless, it is still worth it to keep on taking shots at an answer.