The development of SIERA has done a lot to stir the DIPS pot and got me thinking a great deal about how we value pitchers. I have come to the conclusion that all of the well-known implementations of the Wins Above Replacement framework (Fangraphs, Rally, and StatCorner) leave something to be desired with respect to pitcher evaluation. To make a long story short, Fangraphs' WAR uses FIP as their production metric, which ignores a pitcher's ability to influence batted ball types, Rally's WAR uses RA as their production metric, and even with the total zone adjustment, it isn't rooted in DIPS theory as much as I'd like, and StatCorner WAR uses tRA as their production metric, which treats LD% as a pitcher skill (it is not).
Not that there's anything particularly wrong with this, everyone has their own preferences when it comes to quantifying production. Each of the commonly cited WAR's have very distinct advantages and subtle disadvantages. However, I prefer for my win-value metrics, especially when used in a predictive context, to be completely skill-interactive. So, it became my goal to create a Wins Above Replacement metric that fits with the SIERA theory of DIPS and whose inputs are only league averages and things the pitcher controls (and playing time, of course). The mostly complete list of things a pitcher will influence is: strikeouts, walks, hit batsmen, ground balls, infield fly balls, and non-infield fly balls. The goal is to build a framework to calculate WAR from only those six things (and playing time).
I always prefer to start with rate stats on a per-PA basis, so the five inputs I chose are K%, BB%, HBP%, GB/FB, and IF/FB%. Along with games and batters faced, we can get every piece of information we need to calculate a SIERA-friendly version of a DIPS-based WAR.
The short version of how I did it. From a pitcher's BF, K%, BB%, HBP%, GB/FB, and IF/FB% and the league average LD%, we can calculate expected strikeouts, walks, hit batsmen, ground balls, line drives, infield fly balls, and non infield fly balls. Next, using the data that Colin Wyers published here, we can calculate expected singles, doubles, triples, and home runs from the the batted ball type totals. In the process, we use the batted ball outs along with strikeouts to calculate innings pitched. The expected outcomes of batted ball types is summarized in this image (taken from the link above):
At this point, there are a number of ways to calculate WAR from the expected outcomes. The most simple way I know of is to plug the expected outcomes into a base run estimator, whose calculation I won't insult your intelligence by describing in this space. We then take the base runs total and convert it to a BsR ERA (scaling in the process). This BsR ERA is what we'll use to quantify the production aspect of WAR. Patroit talked a little bit about a pseudo-SIERA using BsR model here. Mine is much the same, and it comes from the same theoretical place. What I've done is added an extra (and somewhat unnecessary) step, converting everything from rate stats into counting stats and back to a rate stat (BsR ERA). The reason I did this is 1) it's a lot easier for me to understand everything that's going on in the equations if I'm doing it this way, rather than plugging rate stats into the run estimator. 2) It's easier to see why a player over/under performed his true talent level if we have the counting stats in front of us. Having a table with a pitcher's expected singles, doubles, homers, et cetera frequently sheds more light than simply observing his BABIP or the like.
After that the calculations are far from simple, but they're nothing we're not familiar with. We adjust the run environment to account for the pitcher we're trying to measure's impact on said run environment. We subtract the pitcher's expected BsR ErA from the league average BsR ERA and add the difference to .500. From that number, we subtract replacement level (.380 for starters, .470 for relievers), multiply by innings pitched, divide by nine, and account for chaining if we're dealing with relievers, and that's pretty much it. For a better explanation of the calculations than I could possibly give, refer to Cameron's win value series, specifically Pitcher Win Values Part Seven.
Before I present the results, I want to talk a little bit about what we have here, exactly. Basically, this is a luck removed, pitcher-skill interactive, base runs ERA estimator (using expected outcomes of batted ball types) based WAR. It's luck removed because we've assumed league average for things pitchers show little to no ability to control (HR/FB, BABIP on different batted ball types, LD%, et cetera). It's pitcher-skill interactive because the only things we used to build the metric are strictly pitcher skills, skills in which little luck or defense involved. It's an ERA estimator that uses expected outcomes of batted ball types in conjunction with the base runs equation. The use of base runs also has the added effect of removing the impact of the timing of events on actual outcomes--something pitchers have shown no ability to influence.
Think about it this way--it's a form of tRA with an xBABIP component that adjusts for the fact that pitchers have little influence over how frequently they yield line drives. Or, a SIERA-based WAR, because all SIERA does differently in theory from tRA is adjust for the randomness of line drives. One result of this is the standard deviation of the xBABIP's is extremely small. Of the 265 starters with at least 10 innings in 2009, the highest xBABIP was .317 and the lowest was .281. The .006 standard deviation isn't a fifth of the size we'd expect a typical year's actual BABIP's SD's to be. The reason for this is pitchers have little control over their BABIP. GB/FB is the only pitcher skill (other than strikeouts, which I'll get to) that influences BABIP. And with this metric, 85 per cent of BABIP is explained GB/FB ratio:
In reality, the standard deviation of BABIP's will be much greater, but that doesn't mean our forecast should. BABIP tells us little about how well a pitcher performed, and we've removed most of the useless information from BABIP, which theoretically should result in a lower standard error, or at worst no worse than what we've currently got. Some of the other positives:
- It considers only pitcher skills. None of the other common WAR's do this. In Rally's WAR, some luck and timing is expressed. With Fangraphs' WAR, home run rates are credited to the pitcher, and home run rates are a function of three things: fly ball rate (a pitcher skill), park (not a pitcher skill), and randomness (also, not a pitcher skill). StatCorner's WAR credits a pitcher with their line drive percentage, when in fact it's not something they have much--if any--control over.
- It considers all pitcher skills. Fangraphs' WAR doesn't treat extreme ground ball and extreme fly ball pitchers any differently, when in fact a ground ball is always less likely to result in runs than a fly ball and pitchers are able to influence their ground ball to fly ball ratio. Rally's pretty much considers all pitcher skills, though it's the otherwise noise involved with using actual runs that makes Rally's less appealing to me. And tRA considers all of the pitcher skills too. Again, it's the fact that tRA treats line drive percentage like a pitcher skill that makes it not as useful as I'd like, especially when it comes to predictive value.
- I think the theory and methodology can be especially useful for forecasting. I'm sure we're already seeing some of this, but pitcher skills seems like a good place to start, no? In theory, it's the DIPS advocate's dream WAR. Unfortunately, that's about where the good ends.
Some limitations and areas to improve:
- Adjustments. I haven't made nearly enough. Ideally, we'd have two versions, one with generic expected batted ball outcomes like I've used and one with park-specific expected batted ball outcomes (especially useful for forecasting). I don't even have league-specific averages, so I'm obviously not too close to getting everything properly adjusted.
- How do we account for regression when regression is due? This is something I should look to address using forecasting methodology, I think. I'm particularly concerned with how to handle IF/FB%, something pitchers seem to have subtle influence over at best.
- Leverage. Is there a better way to handle the leverage aspect?
- This model does not account for the fact that there's a subtle but not insignificant relationship between BABIP and strikeout rate. Finding a way to account for this would really be a nice feature.
- Data consistency. I'm using Fangraphs' data for all of the inputs because it's the only place I know of (besides Retrosheet) that offers all of these metrics. However, some of the components of the metric aren't perfectly calibrated for Fangraphs' data.
Enough talking about it though, let's get to the leaderboard from 2009, starting with starters (naturally):
And the relievers:
The complete 2009 results (minimum 10 IP) can be downloaded here.
One thing I realized while doing this is just how good MLB teams are at keeping sub-replacement level players off the field. Here are the Skill Interactive WAR trailers from 2009, starters (minimum 100 IP):
And relievers (minimum 50 IP):
These are just the 2009 totals, and I may be off base by assuming this notion holds true throughout history (or even recent history). But I think the fact that only one starter pitched 100 innings cumulatively below replacement level and only one reliever pitched 50 innings cumulatively below replacement level in 2009 tells us two important things. One is MLB teams are better at evaluating talent than we think. If a player is fundamentally below replacement level, chances are he won't pitch too much. A simple reminder that they're the professionals and they mostly know what they're doing is always healthy. The second thing is the fact that the WAR framework models reality very well. Circular logic, yes, but in all likelihood, these two things are true.
In conclusion, I think it is appropriate to continue to explore the application of SIERA or a SIERA-friendly metric within the WAR framework. WAR is the best model I know of to quantitatively evaluate players, and it only makes sense to try to build a skill-interactive version. This is only the first step in the process (this is a beta version, if you will), but seeing how well it's gone so far, I think it is definitely doable at a useful quality level. I couldn't be happier with the results thus far.