FIP over- and under-performance factors

Do lefties outperform their FIP more often than righties? If so, why? If not, why does anyone?

By John Choiniere May 1, 2015, 1:00pm EDT

Al Leiter's 2004 had the largest FIP-ERA in the dataset used

Chris Trotman/Getty Images

This article has been edited after posting to include effect sizes, in the form of Cohen's D, in the tables at the end.

Thank you to Chris Teeter and Stephen Loftus for help in understanding some of the statistical concepts discussed here.

In a chat at Fangraphs last week, Dave Cameron was asked a question about why left-handed pitchers seem to out-perform their FIP more frequently than RHPs do. He responded that it may have something to do with controlling the running game.

While I'm pretty certain that Dave is right - he's a far better and more knowledgeable analyst than I am, so I trust his response - I'm wondering whether the premise is correct. Do LHPs outperform their FIP more frequently than RHPs? And while I'm at it, is there a skills-based reason why that would be true?

To look into this, I pulled a whole bunch of stats from Fangraphs for all qualified pitchers from 2004-2014 (the start date was chosen arbitrarily). Beyond the obvious FIP and ERA, the dataset included stats that are highly dependent on factors external to the pitcher (i.e. K%, BABIP, others) and stats that are mostly internal to the pitcher (i.e. pitch type usage/velocity, Zone%, others), as well as stolen bases and stolen base attempts just to check Dave's idea. What I found was that there is, in fact, a small but real effect of LHPs having a better ERA than their FIP suggests; however, I wasn't able to determine precisely why that is.

Before moving on, I'll also note that the first draft of this article included 2015 data, which as it turns out was dramatically skewing the results. Since it's still so early in the season, I have no problem with excluding it from the analysis.

In my sample of 280 lefties and 674 righties, the average LHP overperformance (i.e., FIP-ERA) was 0.150 while the average RHP overperformance was 0.041 - so far, so good for the premise. The standard deviation of each sample was 0.469 and 0.517, respectively. Looking at a box plot for each, while I see a clear separation between the means, the large amount of overlap between the two makes the separation a bit murkier (note: for the sake of clarity, outliers have been removed from the following plots).

Graphically speaking, there appears to be a difference in the two populations; the mean of the RHP group is closer to zero (which we already knew), and the overall RHP distribution is a bit wider. This is clearer if we look at a histogram of the data (I'll include the box plot as well, rotated to be horizontal, to aid in comparison) Here it is for lefties:

Similarly, here's the righty histogram and box plot:

And finally, again for the sake of comparison, here's the overlaid LHP/RHP density plot comparison:

A couple of things jump out at me from those graphs. First, I can't easily (read: visually) determine whether or not there's a difference between the two groups. I see that the average is different, but there's still a ton of overlap, and there's the issue of two separate sample sizes to deal with. To get a better answer, I'll turn to some more formalized statistical tests in the subsequent paragraphs. Second, in contrast to the peak-with-shoulder shape of the LHP graph, it almost looks like there are two distinct (and overlapping) peaks in the RHP density plot. This is also worthy of further investigation, but I'll leave that to a different article and/or researcher.

The first test I'll use to examine the LHP/RHP difference is the t-test. As I understand it (having never taken a stats class ever, so maybe this is all wrong), a t-test is a test of the hypothesis that the true averages of two samples are different. The p-value associated with the result of the t-test tells you the percent chance that you'd see the data you see if the true averages were the same. In this case, we have to use Welch's t-test (as opposed to the more familiar Student's t-test), because our LHP and RHP samples are not equally sized and do not have equal variance (though the variances are pretty close). The result I get is a p-value of 0.00164 - that is, there's only a 0.16% chance of seeing these data if the true averages were the same! This indicates that the difference between LHPs and RHPs is real.

However, the t-test isn't perfect. In this case, it may not be appropriate because it assumes a normal (i.e., Gaussian/"bell-shaped") distribution of the data. I've read that the particular version of the t-test I used is fairly robust even with non-normal data, but to be thorough I also looked at the Mann-Whitney U test (also known as the Wilcoxon rank-sum test), which performs better than the t-test on non-normal distributions. For this test, I found a U value of 106348.5, a p-value of 0.00198, and a ρ-value (which is U divided by the product of the two sample sizes, where 0.5 indicates perfect sample overlap and 1 indicates no overlap) of 0.564. This indicates a very small, but still definitely statistically-significant, difference between LHPs and RHPs.

Lastly, I used the Kolmogorov-Smirnov (KS) test, which I believe is the most general-use test of the three included here. As I understand it, KS tests for whether two entire distributions, not just their means or medians, are likely to be different. Using the two-sample version of it, the resulting p-value is 0.00182 - that is, there's only a 0.182% chance of seeing these data if the true averages were the same.

These three results indicate that the difference between left-handers and right-handers, in terms of outperforming their FIP, is a real one - that lefties do, in fact, outperform their FIP more often than righties. However, that's not the whole story. There are two aspects to looking at a difference between two populations - determining whether the difference is real, and determining how big that difference really is. In this case, we already have one measure of the difference - the two means, 0.15 and 0.04 for LHPs and RHPs respectively - but we can do better than that with a measure called Cohen's D. Cohen's D is basically the standardized difference between the means of the two datasets (it's scaled by their overall standard deviation). In this case, the value for d is 0.216, which is classified by Cohen as a "small" effect.

So it's a real, though small, effect. Why does it exist in the first place?

I used the KS test again, this time looking for significant differences in 43 statistics across FIP under- and over-performers (including total batters faced, which I included as a makeshift control to make sure there wasn't a significance to playing time). The p-values for each are listed at the end of this article, which also is how you can find the complete list of tested stats. I found 14 total stats that had a traditionally-significant (p less than 0.05) difference between the over- and under-performers: K/9, WHIP, BABIP, LOB%, GB/FB, LD%, GB%, FB%, IFFB%, K%, changeup velocity, O-swing%, O-contact%, and stolen base attempts. Further, total batters faced, two-seam%, and contact% were extremely close to making the cut. I then used the same KS test to look for differences between righties and lefties just among those 17 categories, and I found significant differences in five: LOB%, IFFB%, two-seam%, changeup velocity, and stolen base attempts.

I next looked at the correlations between pitchers' FIP-ERA and those five statistics. Of the five, only one showed any significance at all. LOB% correlates with FIP-ERA at the level r^2 = 0.543; none of the others were above 0.02. Therefore, the bottom line to all this is that I found a small but significant difference between the FIP-ERA of LHPs versus RHPs, and the only thing I found out of 43 statistics examined that could potentially explain it is LOB%, which can explain 54.3% of the difference.

So, now I'm going to take a different tactic and see if I can come up with a model that can predict FIP-ERA regardless of handedness. I realize Linear Mixed Models are all the rage in baseball analysis right now, but they're a little beyond my current ability, so I stuck with old-fashioned linear regression. My attempts completely broke down whenever I included any sort of pitch type usage or velocity data, so I left those out. Just using the other stats I had at the ready, I found three highly-significant factors - WHIP, LOB%, and K% - and seven others with some level of significance - the fitted intercept, K/9, BB/9, HR/9, LD%, GB%, and FB%. Using the coefficients the regression found, I created two test data points for each pitcher, one using only the highly-significant factors and the other using any significant factors, and checked their correlation with the observed FIP-ERA. Although K/9 and K% are highly similar, and a more rigorous examination would likely account for that, I took this to be casual enough that I left them both in the "any significance" version.

The "high significance" was actually worse at predicting FIP-ERA (going by correlation) than simply LOB%, with an r^2 of only 0.521. However, the "any significance" model was a large improvement, with an r^2 of 0.811, Here's a scatterplot of that dataset, adjusted linearly so that the average prediction is zero:

And here's a chart of the linear model coefficients, including p-values. Anything with a p-value under 0.1 was included in the "any significance" model, while only factors with a p-value under 0.001 were included in the "high significance" one.

Coefficient	Estimate	Std. Error	t-value	p-value
Intercept	-2.855e+01	1.357e+01	-2.103	0.03581
K_per_9	2.043e-01	9.153e-02	2.232	0.02597
BB_per_9	4.583e-01	2.658e-01	1.724	0.08519
K_per_BB	-4.551e-05	1.796e-02	-0.003	0.99798
HR_per_9	4.510e-01	1.556e-01	2.899	0.00387
WHIP	-2.605e+00	6.127e-01	-4.252	2.43e-05
BABIP	-1.618e+00	2.137e+00	-0.757	0.44933
LOB_pct	8.964e+00	2.163e-01	41.441	less than 2e-16
GB_per_FB	-4.012e-02	6.178e-02	-0.649	0.51626
LD_pct	2.619e+01	1.350e+01	1.940	0.05279
GB_pct	2.735e+01	1.348e+01	2.028	0.04291
FB_pct	2.648e+01	1.350e+01	1.961	0.05028
IFFB_pct	2.915e-02	2.893e-01	0.101	0.91979
HR_per_FB	-8.277e-01	1.336e+00	-0.619	0.53590
K_pct	-1.545e+01	3.511e+00	-4.400	1.26e-05
BB_pct	-6.442e+00	9.567e+00	-0.673	0.50097
O_Swing_pct	-1.550e+00	2.889e+00	-0.537	0.59175
Z_Swing_pct	-1.761e+00	3.129e+00	-0.563	0.57385
Swing_pct	3.768e+00	6.028e+00	0.625	0.53219
O_Contact_pct	1.788e-02	9.414e-01	0.019	0.98485
Z_Contact_pct	-4.365e-01	1.967e+00	-0.222	0.82444
Contact_pct	4.516e-01	2.886e+00	0.156	0.87572
Zone_pct	-1.485e+00	2.253e+00	-0.659	0.50992
SB_att	-2.557e-05	8.329e-04	-0.031	0.97552

The analysis in this article was conducted via R; code and data beyond standard R functions (like plot, etc) are available at https://github.com/johnchoiniere/fip_article.

KS Test results, LHP v. RHP:

Statistic	Cohen's D	p-value
TBF	0.0519	0.3738
K_per_BB	0.0713	0.005694
HR_per_9	0.0002	0.6881
WHIP	0.0579	0.1858
BABIP	0.0506	0.6215
LOB_pct	0.1850	0.03797
GB_per_FB	0.2095	0.2825
LD_pct	0.0103	0.9768
GB_pct	0.1825	0.2271
FB_pct	0.1857	0.3018
IFFB_pct	0.2475	0.003577
HR_per_FB	0.0949	0.5586
K_pct	0.0431	0.6723
BB_pct	0.2086	0.1085
FA_pct	0.0198	0.8626
FT_pct	0.3711	0.001394
FC_pct	0.1950	0.0008498
FS_pct	0.0505	0.5089
SI_pct	0.2401	0.002531
SL_pct	0.5397	6.614e-07
CU_pct	0.0695	0.5771
KC_pct	0.5070	0.4596
CH_pct	0.5787	3.366e-08
vFA	0.4972	2.494e-06
vFT	0.9290	2.479e-09
vFC	0.7590	9.438e-09
vFS	0.2108	0.07359
vSI	0.0877	0.05809
vSL	0.7175	7.771e-16
vCU	0.4696	4.162e-09
vKC	0.5057	0.5041
vCH	0.7294	1.208e-07
O_Swing_pct	0.0285	0.8688
Z_Swing_pct	0.2038	0.2042
Swing_pct	0.0708	0.07568
O_Contact_pct	0.0497	0.5849
Z_Contact_pct	0.1935	0.09012
Contact_pct	0.0618	0.7505
Zone_pct	0.0576	0.3024
SB_att	0.2428	0.04741

KS Test results, FIP-ERA>0 v. FIP-ERA<0

Statistic	Cohen's D	p-value
TBF	0.01429	0.06307
K_per_BB	0.1847	0.07852
HR_per_9	0.1334	0.07836
WHIP	0.5985	7.604e-12
BABIP	1.398	less than 2e-16
LOB_pct	1.455	less than 2e-16
GB_per_FB	0.125	0.009734
LD_pct	0.3503	4.365e-05
GB_pct	0.1000	0.01886
FB_pct	0.2227	0.01178
IFFB_pct	0.2394	0.001278
HR_per_FB	0.06530	0.5514
K_pct	0.1425	0.01954
BB_pct	0.07966	0.6579
FA_pct	0.02362	0.7565
FT_pct	0.1286	0.06028
FC_pct	0.1509	0.2467
FS_pct	0.1326	0.3847
SI_pct	0.1491	0.5995
SL_pct	0.05600	0.9728
CU_pct	0.01358	0.9405
KC_pct	0.4016	0.4674
CH_pct	0.1106	0.2276
vFA	0.1647	0.3736
vFT	0.1650	0.2911
vFC	0.1665	0.2766
vFS	0.3521	0.6644
vSI	0.02397	0.8079
vSL	0.1910	0.07691
vCU	0.2338	0.09526
vKC	0.7721	0.5546
vCH	0.2490	0.002974
O_Swing_pct	0.1431	0.04009
Z_Swing_pct	0.01711	0.4663
Swing_pct	0.04245	0.7286
O_Contact_pct	0.1949	0.002054
Z_Contact_pct	0.05089	0.6402
Contact_pct	0.1187	0.06335
Zone_pct	0.07210	0.3092
SB_att	0.2145	0.0009366

All statistics courtesy of FanGraphs.

John Choiniere is a researcher and (occasional) contributor at Beyond the Box Score. You can follow him on Twitter at @johnchoiniere.

FIP over- and under-performance factors

Share this story

Share All sharing options for: FIP over- and under-performance factors

More From Beyond the Box Score

Share this story

All sharing options for: FIP over- and under-performance factors