Introducing pLD, pFB, pGB: Accurately predicting batted-ball info

It is simple to predict, this one wasn't coming back. - Thearon W. Henderson

After reading up on an awesome Glenn DuPaul article last month -- "Projecting BABIP and Regression toward the Mean" -- I became interested in more accurately regressing BABIP to the mean.

I decided to derive my own predictive model not for BABIP, but for LD, FB, and GB so that I could, in the future, predict BABIP as a function of these derived batted ball stats. Keep in mind that these models are derived using all qualified batters from 2008-2012, with a population of 892.

These models are most accurately used on any hitter's totals from 2008 through 2012 and beyond.


Lets jump into the methodology for deriving the predictive model for pGB. Being that GB has a year-to-year R^2 of .64 in this sample size, it should be the easiest to predict. Using the idea mentioned in Glenn's article, I first tried to use his model: YEAR 2= (rGB*year1GB)+(1-rGB*lgGB), with rGB being the correlation.

However, using this model weights the league average too much in my estimation. [Note: DuPaul's model was intended for pitcher projections.] It is reasonable to say that BABIP is more of a skill for a hitter than it is for a pitcher -- a hitter who possesses certain skills may be more likely to maintain a higher BABIP.

As a result, I had to tweak this model to more evenly weight the league average. I derived a similar model but switched the weights so that the league average was weighted by the r of GB and the individual's batted ball info was 1-r as to account for all the other variables for a GB. Using that model, I was left with year one and year two statistics. The next step was to create a single formula that could predict for the single next season, in other words predictive.

The formula for the regressed totals was GByear2 = 0.7842(GByear1)+ 42.283. So now the next step was to create a weighting system to properly weight the league average while not crippling the actual batted ball total. Using the differential of the league average GB rate and the actual season total, I was able to find the r of the differential to the season total. Using that correlation as the weight in the formula while accounting for the previous regressed totals looks like this:

YEAR 2 GB = (0.7842*(GB)+42.283)*(1-rDifferential)+(rDifferential *league averageGB rate)

Which for this data set translates to roughly:

pGB = (0.7842*(GB)+42.283)*(0.82)+(.18*194.38)

The percent error for this model was 7.2% which was a vast improvement. For instance, using the predictive model used in DuPaul's piece would yield a percent error of 11.74%. The standard error was 0.9371.

For LD the same exact methodology was used to derive this model. The end results looked like this:

pLD = (0.4642*(LD)+47.891)*(0.88)+(0.12* 87.46)

The percent error for this model was 10.2% percent. That's another large decrease over the original, that yielded a percent error of 16.89%. The standard error for this model was 0.24. Being that LD has a year-to-year R^2 of .19, prediction within 10% of the LD total seems rather accurate.

Last but not least, there's the FB predictive model. Surprisingly, this was the most accurate for this population:

pFB = ((0.5839*(FB)+69.412)*(1.03))+((-0.03*170.18)

pFB yielded a 6.9 percent error another improvement over the 12.6 predicted by the original method.

As you can see, all three metrics are vast improvements over other widely-used methods to regress statistics to the mean. While the accuracy of the predictions may not be perfect given the high variability of BABIP, it is a necessity to first be able to predict the essence of BABIP before trying to tackle the daunting prospect of pinpointing future performance of balls batted in play.

Being that BABIP is in the control of the hitter more so than the pitcher, it is only reasonable to change the weight of the league average so that we weaken the chances for the predictive model to favor the league average. By doing so we find a more accurate, precise model. Hopefully this is the next step in taming the ever-elusive variation present in BABIP totals.

All stats courtesy of the Lahman Database and FanGraphs. You can contact Max Weinstein at or on Twitter @maxweinstein21.

Trending Discussions