clock menu more-arrow no yes mobile

Filed under:

What factors best predict the variance of ERA-FIP among starting pitchers?

You've read it a thousand times. A pitcher is over/under performing his peripherals and his ERA should move in the direction of a truer measure of performance, which we commonly accept to be represented by FIP. We know that for most pitchers these values diverge, but can we account for the variance of ERA-FIP?

Brad Mills-USA TODAY Sports

If you follow sabermetric baseball analysis you probably read about pitchers over or underperforming their peripherals on a daily basis; typically, by way of a reference to a pitcher's ERA relative to their FIP. The former gives an account of what has happened in terms of actual earned runs allowed, while we believe the latter to tell us more about the role the pitcher played independent of his defense. FIP does this by taking into account the events over which the defense has the least impact -- strikeouts, walks/hit batters, and homeruns -- and also lays out a reasonable expectation of the performance to come.

Take FIP from ERA (E-F) and you have a measure of the pitcher's earned run results relative to the performance we might expect if his defense nabbed an average number of batted balls. While this handy metric acts as a proxy for the divergence between individual performance and observed outcome, we're left to account for the variance between pitchers. It should be noted that neither metric is a perfect reflection of the latent concept it seeks to measure, but they are two widely used metrics worthy of study. While FIP isn't perfect, FIP is pretty good, and understanding how and why a pitcher's ERA would be higher or lower is the aim of this effort.

At the end of the 2014 season, Neil Weinberg did a case study in which he provided an explanation for the largest positive (Clay Buchholz)  and negative (Doug Fister) E-F values that season. I've followed his lead here, but I've also added a couple new indicators -- some of which are proxies -- which I hypothesize will also account for some of the variance of E-F. Feel free to disagree with these variables or to suggest new ones.

  1. BABIP -- luck (though not exclusively luck)
  2. Percentage of runners left on base -- a proxy for sequencing of events
  3. Ground ball rate -- type of contact induced
  4. Average exit velocity of batted balls -- quality of contact induced
  5. Rate of stolen base attempts -- ability to control the running game (click here for an explanation of how that's calculated)
  6. Expected runs allowed per nine innings (RE24/9) minus park adjusted runs scored per nine (pRA9) -- a proxy for bullpen influence on runs allowed
The last predictor merits further explanation. I borrowed this metric from Neil's article, where he gives the rationale for including it in his study.

There's a very simple way to evaluate this [bullpen support] effect. If you leave a runner on first base every time you get pulled, but the bullpen allows them to score every time, all of those runs aren't really your fault. You should really only be charged for the expected number of runs, which we can approximate using RE24 on a per 9 inning scale. We'll also have to park adjust their RA9.

Before I go any further I'll lay out the parameters of the study population. I used first-half stats for qualified starting pitchers, according to FanGraphs (n=95). At the All-Star break E-F for this group ranged from 1.68 (Drew Hutchison) to -1.56 (Hector Santiago) with a unweighted mean of 0.03 (standard deviation 0.67). Interestingly, the median value was 0.00, courtesy of Phillies' starter Aaron Harang.

I pulled the data listed above from FanGraphs, with the exception of SBA% (calculated from stats available at Baseball-Reference) and average exit velocity (Baseball Savant). While a larger sample of data would be useful, we only have access to StatCast batted ball velocity for the 2015 season. As that data becomes more widely available, the sample size for this study can increase. If anything, the small sample will make this a harder test.


These variables were the predictors in a multiple linear regression model, in which E-F was the response variable. This models the degree to which these six factors accounts for the variance in E-F as well as their relative importance, or the amount of variance for which they account. The model is represented by the formula below, where epsilon indicates the error in the equation, or the variance in E-F that the model isn't able to explain.

E-F ~ BABIP + LOB% + GB% + SBA% + AverageExitVelo + (RE24/9)/(pRA/9) + ε

Overall, these factors do a pretty good job accounting for the variance in E-F  (adjusted R^2=0.78), and the full model result was significant (F(6,88=55.2, p less than 0.0001). Great!

Each of the fields represented data from a different subject (i.e., we didn't use any stats that were shared, such as team defense), so the criteria of independence required use this type of analysis is satisfied. The model also satisfied the remaining assumptions needed to confidently interpret a multiple linear regression model. If you're interested, I've appended these results to the end of the article.

So let's look at the results of the regression! One important note is that as we go through the effect of each predictor, remember that the change in E-F assumes that all other independent variables in the equation are being held constant.

Estimate Std. Error t value Pr(>|t|) p-value less than
(Intercept) 6.82 2.33 2.93 0.00 0.0010
BABIP 12.25 1.38 8.85 0.00 0.0001
LOB -6.23 0.74 -8.45 0.00 0.0001
GB -0.84 0.49 -1.73 0.09 0.1000
SBA 0.068 0.87 0.08 0.94 1.0000
AverageExitVelo -0.062 0.03 -2.37 0.02 0.0500
RE249RS9 0.078 0.15 0.50 0.62 1.0000

As you may have guess, the change of one unit of BABIP produces the largest change in E-F, but we'll need to do a bit of a correction from the estimate value in the table for this to make sense in the real world. Essentially, for every 10 points a pitchers BABIP goes up, E-F increases by 0.12. That means that if you increase the BABIP for a (qualified) starting pitcher with league average E-F from .290 to .300 and the model predicts that he will under perform his peripherals by 0.12 runs.

Same goes for LOB%, except that moves E-F in the opposite direction. Change a pitchers LOB% from league average (73%) to, let's say, 83%, and you can expect him to preform better than his peripherals by 0.06 runs, all else equal.

The last result with a significant impact on the variance of E-F is average exit velocity (keep in mind that the StatCast data is not fully complete, so take this with a grain of salt). Since this is reported in mph, the interpretation is straight forward. With a negative slope, an increase in batted ball velocity predicts a decrease in E-F. So if a pitcher in our sample sees their average batted ball velocity against increase by a mile per hour, the model predicts that they would also see their ERA decrease relative to their FIP by 0.06 runs.

On the surface, this may seem counterintuitive based on what we know about batted ball velocity. But if you think about it from the perspective of FIP, it makes sense as home runs are the only batted balls represented in the formula, and they count for for a lot.

FIP = ((13*HR)+(3*(BB+HBP))-(2*K))/IP + constant


Consider the following hypothetical situation. If a pitcher were to load the bases without walking anyone and give up a home run, his ERA would be 36.00 but his FIP would be 117.00. So, if you agree that hard-struck balls are more likely to be home runs, harder contact could inflate FIP, relative to ERA. We can test this for our sample, by plotting average exit velocity against HR/FB rate for the same group of pitchers. Indeed we do see a positive relationship between the two, thought its not a particularly strong one. The strength of this relationship makes sense in light of Rob Arthur's recent work on exit velocity over at FiveThirtyEight , where he showed that exit velocity is five-parts hitter, one-part pitcher.

If you're willing to accept a p-value of less than 0.1, then GB% can be considered a moderately significant predictor. As you would expect, the model predicts that an increase in GB% would cause a an decrease in E-F, or a pitcher performing better than his peripherals would suggest. Ground balls are desirable because they turn into hits less often than line drives, and they don't ever turn into home runs the way fly balls can. The effect, of GB% on E-F, however, is neither particularly significant, nor substantial. The model predicts that Increasing a pitcher's ground ball rate by 10% (which is a pretty huge jump) would only shift E-F by -0.08 runs.

The remaining predictors, SBA% and our proxy for bullpen support account for E-F variance in they way you would expect (SBA% is bad; bullpen is good), but those values weren't even close to significant.

So we've seen the effect of each of our predictors in the model on the variance in E-F, but how does much of the total variance observed do each of these factors account for? To answer that question, I've run an ANOVA on the model to partition the variance.

Mean Sq % Variance F value Pr(>F) p-value less than
BABIP 23.61 70.9% 235.53 0.00 0.0001
LOB 8.83 26.5% 88.08 0.00 0.0001
GB 0.17 0.5% 1.72 0.19 1.0000
SBA 0.00 0.0% 0.02 0.89 1.0000
AverageExitVelo 0.56 1.7% 5.60 0.02 0.0500
RE249RS9 0.03 0.1% 0.25 0.62 1.0000
Residuals 0.10 0.3%

Once again, it's all about BABIP. This predictor alone accounts for more than two-thirds (70.9%) of the variance predicted by the model. LOB% accounts for 26.5% of the variance, making it the second most influential driver of variance in E-F. The only other factor with a significant influence on the variance predicted by this model is average exit velocity of batted balls. While significant, this predictor accounted for just 1.7% of the variance observed in our response variable.


There are certain pitchers who seem to beat or fall short of their FIP but, typically, this happens for only a brief period before their actual and fielding independent performance fall into step. Today, I showed that you can explain most of the difference between ERA and FIP -- for qualified starting pitchers in the first half of this season -- by looking at their BABIP and LOB%. This is not a surprising result,  but it is an important one to reinforce and quantify as we go forward.

Keep in mind, however, that I have not stated anything beyond this relationship. I haven't made an argument that BABIP or LOB% variance is anything more than random chance. This is simply the first stage in a multi-step project. Next, I'll consider the role of defense and its interplay with the significant factors in this model. By continuing to unpack these predictors, we can work toward better understanding the most important reasons for variance in the difference between ERA and FIP.

Thanks to Neil Weinberg and Russell Carleton for their helpful comments (and extreme patience). All modeling was done using R. You can find a copy of the data here and the syntax I used here. Please feel free to take issue with my analysis and/or interpretations in the comments section below.

. . .

Matt Jackson is a featured writer for Beyond the Box Score and a staff writer for Royals Review. You can follow him on Twitter at @jacksontaigu.


Assumption #1:Linearity

Check! The residuals (distance of each point from the fitted line) look pretty good. There is a small amount of cure to the line on the residual vs. fitted values plot (top left), but this is real world data, and I'm comfortable with the fit. The points on the QQ plot (bottom left) fall mostly in a straight line, so we're all good here.

You'll see a few points that appear to be outliers (5, 54, and 77). To check this, I used an outlier test that checks Bonferonni values for most extreme observations. All good again with no p-values > 0.05. Number 77 was the greatest outlier, but you would know him better as Nick Martinez. I also checked Cook's distance values for influential observations. Using the cutoff of 4/(1-n-k), three influential values were identified (5, 93, and 68). I'm not too worried about these guys, but I am curious about what they're doing to influence the model. You'll notice that number 5 showed up as an outlier and an influential observation. That's Jon Neise if you're curious.

Assumption #2: Absence of collinearity

When using multiple predictors, it's important to make sure that they're not correlated because if they move together too much, it makes it pretty tough to interpret the result, which is the whole point of building a model in the first place. Luckily, we can check this box off as well. The variance inflation factors (VIF) fall between 1.0 and 1.5. Perfectly acceptable. Moving on.

Assumption #3: Homoskedasticity

Here we'll look back at the residuals plot from assumption #1. In a homoskedastic plot the residuals should look like a close of points, rather than a particular shape, like a cone, or funnel.

Assumption #4: Normality of residuals

Always with the residuals! I've plotted the studentized residuals frequency as a histogram and fit it with a curved better help visualize the distribution. As you can see, the curve has a nice, albeit real-world, bell shape. Skew an kurtosis met the constraints of acceptability according to the global assumption test.