This article has been edited after posting to include effect sizes, in the form of Cohen's D, in the tables at the end.
Thank you to Chris Teeter and Stephen Loftus for help in understanding some of the statistical concepts discussed here.
In a chat at Fangraphs last week, Dave Cameron was asked a question about why left-handed pitchers seem to out-perform their FIP more frequently than RHPs do. He responded that it may have something to do with controlling the running game.
While I'm pretty certain that Dave is right - he's a far better and more knowledgeable analyst than I am, so I trust his response - I'm wondering whether the premise is correct. Do LHPs outperform their FIP more frequently than RHPs? And while I'm at it, is there a skills-based reason why that would be true?
To look into this, I pulled a whole bunch of stats from Fangraphs for all qualified pitchers from 2004-2014 (the start date was chosen arbitrarily). Beyond the obvious FIP and ERA, the dataset included stats that are highly dependent on factors external to the pitcher (i.e. K%, BABIP, others) and stats that are mostly internal to the pitcher (i.e. pitch type usage/velocity, Zone%, others), as well as stolen bases and stolen base attempts just to check Dave's idea. What I found was that there is, in fact, a small but real effect of LHPs having a better ERA than their FIP suggests; however, I wasn't able to determine precisely why that is.
Before moving on, I'll also note that the first draft of this article included 2015 data, which as it turns out was dramatically skewing the results. Since it's still so early in the season, I have no problem with excluding it from the analysis.
In my sample of 280 lefties and 674 righties, the average LHP overperformance (i.e., FIP-ERA) was 0.150 while the average RHP overperformance was 0.041 - so far, so good for the premise. The standard deviation of each sample was 0.469 and 0.517, respectively. Looking at a box plot for each, while I see a clear separation between the means, the large amount of overlap between the two makes the separation a bit murkier (note: for the sake of clarity, outliers have been removed from the following plots).
Graphically speaking, there appears to be a difference in the two populations; the mean of the RHP group is closer to zero (which we already knew), and the overall RHP distribution is a bit wider. This is clearer if we look at a histogram of the data (I'll include the box plot as well, rotated to be horizontal, to aid in comparison) Here it is for lefties:
Similarly, here's the righty histogram and box plot: And finally, again for the sake of comparison, here's the overlaid LHP/RHP density plot comparison:
A couple of things jump out at me from those graphs. First, I can't easily (read: visually) determine whether or not there's a difference between the two groups. I see that the average is different, but there's still a ton of overlap, and there's the issue of two separate sample sizes to deal with. To get a better answer, I'll turn to some more formalized statistical tests in the subsequent paragraphs. Second, in contrast to the peak-with-shoulder shape of the LHP graph, it almost looks like there are two distinct (and overlapping) peaks in the RHP density plot. This is also worthy of further investigation, but I'll leave that to a different article and/or researcher.
The first test I'll use to examine the LHP/RHP difference is the t-test. As I understand it (having never taken a stats class ever, so maybe this is all wrong), a t-test is a test of the hypothesis that the true averages of two samples are different. The p-value associated with the result of the t-test tells you the percent chance that you'd see the data you see if the true averages were the same. In this case, we have to use Welch's t-test (as opposed to the more familiar Student's t-test), because our LHP and RHP samples are not equally sized and do not have equal variance (though the variances are pretty close). The result I get is a p-value of 0.00164 - that is, there's only a 0.16% chance of seeing these data if the true averages were the same! This indicates that the difference between LHPs and RHPs is real.
However, the t-test isn't perfect. In this case, it may not be appropriate because it assumes a normal (i.e., Gaussian/"bell-shaped") distribution of the data. I've read that the particular version of the t-test I used is fairly robust even with non-normal data, but to be thorough I also looked at the Mann-Whitney U test (also known as the Wilcoxon rank-sum test), which performs better than the t-test on non-normal distributions. For this test, I found a U value of 106348.5, a p-value of 0.00198, and a ρ-value (which is U divided by the product of the two sample sizes, where 0.5 indicates perfect sample overlap and 1 indicates no overlap) of 0.564. This indicates a very small, but still definitely statistically-significant, difference between LHPs and RHPs.
Lastly, I used the Kolmogorov-Smirnov (KS) test, which I believe is the most general-use test of the three included here. As I understand it, KS tests for whether two entire distributions, not just their means or medians, are likely to be different. Using the two-sample version of it, the resulting p-value is 0.00182 - that is, there's only a 0.182% chance of seeing these data if the true averages were the same.
These three results indicate that the difference between left-handers and right-handers, in terms of outperforming their FIP, is a real one - that lefties do, in fact, outperform their FIP more often than righties. However, that's not the whole story. There are two aspects to looking at a difference between two populations - determining whether the difference is real, and determining how big that difference really is. In this case, we already have one measure of the difference - the two means, 0.15 and 0.04 for LHPs and RHPs respectively - but we can do better than that with a measure called Cohen's D. Cohen's D is basically the standardized difference between the means of the two datasets (it's scaled by their overall standard deviation). In this case, the value for d is 0.216, which is classified by Cohen as a "small" effect.
So it's a real, though small, effect. Why does it exist in the first place?
I used the KS test again, this time looking for significant differences in 43 statistics across FIP under- and over-performers (including total batters faced, which I included as a makeshift control to make sure there wasn't a significance to playing time). The p-values for each are listed at the end of this article, which also is how you can find the complete list of tested stats. I found 14 total stats that had a traditionally-significant (p less than 0.05) difference between the over- and under-performers: K/9, WHIP, BABIP, LOB%, GB/FB, LD%, GB%, FB%, IFFB%, K%, changeup velocity, O-swing%, O-contact%, and stolen base attempts. Further, total batters faced, two-seam%, and contact% were extremely close to making the cut. I then used the same KS test to look for differences between righties and lefties just among those 17 categories, and I found significant differences in five: LOB%, IFFB%, two-seam%, changeup velocity, and stolen base attempts.
I next looked at the correlations between pitchers' FIP-ERA and those five statistics. Of the five, only one showed any significance at all. LOB% correlates with FIP-ERA at the level r^2 = 0.543; none of the others were above 0.02. Therefore, the bottom line to all this is that I found a small but significant difference between the FIP-ERA of LHPs versus RHPs, and the only thing I found out of 43 statistics examined that could potentially explain it is LOB%, which can explain 54.3% of the difference.
So, now I'm going to take a different tactic and see if I can come up with a model that can predict FIP-ERA regardless of handedness. I realize Linear Mixed Models are all the rage in baseball analysis right now, but they're a little beyond my current ability, so I stuck with old-fashioned linear regression. My attempts completely broke down whenever I included any sort of pitch type usage or velocity data, so I left those out. Just using the other stats I had at the ready, I found three highly-significant factors - WHIP, LOB%, and K% - and seven others with some level of significance - the fitted intercept, K/9, BB/9, HR/9, LD%, GB%, and FB%. Using the coefficients the regression found, I created two test data points for each pitcher, one using only the highly-significant factors and the other using any significant factors, and checked their correlation with the observed FIP-ERA. Although K/9 and K% are highly similar, and a more rigorous examination would likely account for that, I took this to be casual enough that I left them both in the "any significance" version.
The "high significance" was actually worse at predicting FIP-ERA (going by correlation) than simply LOB%, with an r^2 of only 0.521. However, the "any significance" model was a large improvement, with an r^2 of 0.811, Here's a scatterplot of that dataset, adjusted linearly so that the average prediction is zero:
And here's a chart of the linear model coefficients, including p-values. Anything with a p-value under 0.1 was included in the "any significance" model, while only factors with a p-value under 0.001 were included in the "high significance" one.
|LOB_pct||8.964e+00||2.163e-01||41.441||less than 2e-16|
The analysis in this article was conducted via R; code and data beyond standard R functions (like plot, etc) are available at https://github.com/johnchoiniere/fip_article.
KS Test results, LHP v. RHP:
KS Test results, FIP-ERA>0 v. FIP-ERA<0
Statistic Cohen's D p-value
less than 2e-16
less than 2e-16
All statistics courtesy of FanGraphs.
John Choiniere is a researcher and (occasional) contributor at Beyond the Box Score. You can follow him on Twitter at @johnchoiniere.