Skill Interactive (SIERA-friendly) WAR
The development of SIERA has done a lot to stir the DIPS pot and got me thinking a great deal about how we value pitchers. I have come to the conclusion that all of the well-known implementations of the Wins Above Replacement framework (Fangraphs, Rally, and StatCorner) leave something to be desired with respect to pitcher evaluation. To make a long story short, Fangraphs' WAR uses FIP as their production metric, which ignores a pitcher's ability to influence batted ball types, Rally's WAR uses RA as their production metric, and even with the total zone adjustment, it isn't rooted in DIPS theory as much as I'd like, and StatCorner WAR uses tRA as their production metric, which treats LD% as a pitcher skill (it is not).
Not that there's anything particularly wrong with this, everyone has their own preferences when it comes to quantifying production. Each of the commonly cited WAR's have very distinct advantages and subtle disadvantages. However, I prefer for my win-value metrics, especially when used in a predictive context, to be completely skill-interactive. So, it became my goal to create a Wins Above Replacement metric that fits with the SIERA theory of DIPS and whose inputs are only league averages and things the pitcher controls (and playing time, of course). The mostly complete list of things a pitcher will influence is: strikeouts, walks, hit batsmen, ground balls, infield fly balls, and non-infield fly balls. The goal is to build a framework to calculate WAR from only those six things (and playing time).
I always prefer to start with rate stats on a per-PA basis, so the five inputs I chose are K%, BB%, HBP%, GB/FB, and IF/FB%. Along with games and batters faced, we can get every piece of information we need to calculate a SIERA-friendly version of a DIPS-based WAR.
The short version of how I did it. From a pitcher's BF, K%, BB%, HBP%, GB/FB, and IF/FB% and the league average LD%, we can calculate expected strikeouts, walks, hit batsmen, ground balls, line drives, infield fly balls, and non infield fly balls. Next, using the data that Colin Wyers published here, we can calculate expected singles, doubles, triples, and home runs from the the batted ball type totals. In the process, we use the batted ball outs along with strikeouts to calculate innings pitched. The expected outcomes of batted ball types is summarized in this image (taken from the link above):
At this point, there are a number of ways to calculate WAR from the expected outcomes. The most simple way I know of is to plug the expected outcomes into a base run estimator, whose calculation I won't insult your intelligence by describing in this space. We then take the base runs total and convert it to a BsR ERA (scaling in the process). This BsR ERA is what we'll use to quantify the production aspect of WAR. Patroit talked a little bit about a pseudo-SIERA using BsR model here. Mine is much the same, and it comes from the same theoretical place. What I've done is added an extra (and somewhat unnecessary) step, converting everything from rate stats into counting stats and back to a rate stat (BsR ERA). The reason I did this is 1) it's a lot easier for me to understand everything that's going on in the equations if I'm doing it this way, rather than plugging rate stats into the run estimator. 2) It's easier to see why a player over/under performed his true talent level if we have the counting stats in front of us. Having a table with a pitcher's expected singles, doubles, homers, et cetera frequently sheds more light than simply observing his BABIP or the like.
After that the calculations are far from simple, but they're nothing we're not familiar with. We adjust the run environment to account for the pitcher we're trying to measure's impact on said run environment. We subtract the pitcher's expected BsR ErA from the league average BsR ERA and add the difference to .500. From that number, we subtract replacement level (.380 for starters, .470 for relievers), multiply by innings pitched, divide by nine, and account for chaining if we're dealing with relievers, and that's pretty much it. For a better explanation of the calculations than I could possibly give, refer to Cameron's win value series, specifically Pitcher Win Values Part Seven.
Before I present the results, I want to talk a little bit about what we have here, exactly. Basically, this is a luck removed, pitcher-skill interactive, base runs ERA estimator (using expected outcomes of batted ball types) based WAR. It's luck removed because we've assumed league average for things pitchers show little to no ability to control (HR/FB, BABIP on different batted ball types, LD%, et cetera). It's pitcher-skill interactive because the only things we used to build the metric are strictly pitcher skills, skills in which little luck or defense involved. It's an ERA estimator that uses expected outcomes of batted ball types in conjunction with the base runs equation. The use of base runs also has the added effect of removing the impact of the timing of events on actual outcomes--something pitchers have shown no ability to influence.
Think about it this way--it's a form of tRA with an xBABIP component that adjusts for the fact that pitchers have little influence over how frequently they yield line drives. Or, a SIERA-based WAR, because all SIERA does differently in theory from tRA is adjust for the randomness of line drives. One result of this is the standard deviation of the xBABIP's is extremely small. Of the 265 starters with at least 10 innings in 2009, the highest xBABIP was .317 and the lowest was .281. The .006 standard deviation isn't a fifth of the size we'd expect a typical year's actual BABIP's SD's to be. The reason for this is pitchers have little control over their BABIP. GB/FB is the only pitcher skill (other than strikeouts, which I'll get to) that influences BABIP. And with this metric, 85 per cent of BABIP is explained GB/FB ratio:
In reality, the standard deviation of BABIP's will be much greater, but that doesn't mean our forecast should. BABIP tells us little about how well a pitcher performed, and we've removed most of the useless information from BABIP, which theoretically should result in a lower standard error, or at worst no worse than what we've currently got. Some of the other positives:
- It considers only pitcher skills. None of the other common WAR's do this. In Rally's WAR, some luck and timing is expressed. With Fangraphs' WAR, home run rates are credited to the pitcher, and home run rates are a function of three things: fly ball rate (a pitcher skill), park (not a pitcher skill), and randomness (also, not a pitcher skill). StatCorner's WAR credits a pitcher with their line drive percentage, when in fact it's not something they have much--if any--control over.
- It considers all pitcher skills. Fangraphs' WAR doesn't treat extreme ground ball and extreme fly ball pitchers any differently, when in fact a ground ball is always less likely to result in runs than a fly ball and pitchers are able to influence their ground ball to fly ball ratio. Rally's pretty much considers all pitcher skills, though it's the otherwise noise involved with using actual runs that makes Rally's less appealing to me. And tRA considers all of the pitcher skills too. Again, it's the fact that tRA treats line drive percentage like a pitcher skill that makes it not as useful as I'd like, especially when it comes to predictive value.
- I think the theory and methodology can be especially useful for forecasting. I'm sure we're already seeing some of this, but pitcher skills seems like a good place to start, no? In theory, it's the DIPS advocate's dream WAR. Unfortunately, that's about where the good ends.
Some limitations and areas to improve:
- Adjustments. I haven't made nearly enough. Ideally, we'd have two versions, one with generic expected batted ball outcomes like I've used and one with park-specific expected batted ball outcomes (especially useful for forecasting). I don't even have league-specific averages, so I'm obviously not too close to getting everything properly adjusted.
- How do we account for regression when regression is due? This is something I should look to address using forecasting methodology, I think. I'm particularly concerned with how to handle IF/FB%, something pitchers seem to have subtle influence over at best.
- Leverage. Is there a better way to handle the leverage aspect?
- This model does not account for the fact that there's a subtle but not insignificant relationship between BABIP and strikeout rate. Finding a way to account for this would really be a nice feature.
- Data consistency. I'm using Fangraphs' data for all of the inputs because it's the only place I know of (besides Retrosheet) that offers all of these metrics. However, some of the components of the metric aren't perfectly calibrated for Fangraphs' data.
Enough talking about it though, let's get to the leaderboard from 2009, starting with starters (naturally):
And the relievers:
The complete 2009 results (minimum 10 IP) can be downloaded here.
.....................
One thing I realized while doing this is just how good MLB teams are at keeping sub-replacement level players off the field. Here are the Skill Interactive WAR trailers from 2009, starters (minimum 100 IP):
And relievers (minimum 50 IP):
These are just the 2009 totals, and I may be off base by assuming this notion holds true throughout history (or even recent history). But I think the fact that only one starter pitched 100 innings cumulatively below replacement level and only one reliever pitched 50 innings cumulatively below replacement level in 2009 tells us two important things. One is MLB teams are better at evaluating talent than we think. If a player is fundamentally below replacement level, chances are he won't pitch too much. A simple reminder that they're the professionals and they mostly know what they're doing is always healthy. The second thing is the fact that the WAR framework models reality very well. Circular logic, yes, but in all likelihood, these two things are true.
In conclusion, I think it is appropriate to continue to explore the application of SIERA or a SIERA-friendly metric within the WAR framework. WAR is the best model I know of to quantitatively evaluate players, and it only makes sense to try to build a skill-interactive version. This is only the first step in the process (this is a beta version, if you will), but seeing how well it's gone so far, I think it is definitely doable at a useful quality level. I couldn't be happier with the results thus far.
3 recs |
20 comments
|
Comments
good read
Nice work sir.
Come check out Bullpen Banter
Follow Bullpen Banter on Twitter
Follow me on Twitter
Remember: baseball guys... baseball...
Is this interactive?
In SIERA, several of the variables are interaction terms, like (K%)^2 or (GB)*(BB). The reason why this is a feature of SIERA is that (for example), a high groundball rate pitcher will in theory be able to sustain a higher walk rate, since he will induce more double plays and allow fewer home runs. I think this is great work but I am not sure it has the interaction that SIERA aims for.
Kind of has that half-way-between-tRA-and-SIERA feel to it. “Fixing” tRA and the LD but not quite SIERA. Though, it is late and I may not be thinking clearly.
My old blog is Tigers By The Numbers.
Now I write at Bless You Boys.
Like music? See what I'm listening to at my Last.fm account.
Basically what it is at this point.
I hope to incorporate the other interactive terms at some point. As of right now, it’s only pseudo-skill interactive.
I probably should’ve stuck with my original name, SIERA-friendly WAR.
Of course, we could just use SIERA rather than this base run ERA estimator, but that has it’s disadvantages from my standpoint, too.
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue
Word. Wasn’t trying to dismiss this. I thought it was really, really interesting.
My old blog is Tigers By The Numbers.
Now I write at Bless You Boys.
Like music? See what I'm listening to at my Last.fm account.
Patriot
If you foil BsR, you’ll find “interactive” terms all over the place in just about every possible combination. The interactive nature of SIERA is unique only when compared to other regressions, which are quite often pure linear regressions.
Of course SIERA’s specific brand of interactivity is unique, namely in that it only considers certain interactive terms, whereas BsR has an interactive term between almost all possible combinations of categories included in the metric.
Oops, I thought the post title field was where my name went. Now it looks as if I wanted my name in bold.
Anyway, the real question is whether BsR-ized pseudo-SIERA has comparable accuracy to SIERA on SIERA’s terms, whatever those may be. My guess would be that it does, but it’s just that.
The short version is "yes."
At least in the version I tested – I crafted a BsR version for the SIERA inputs and the RMSE was practically indistinguishable.
Colin,
Was the BsR pseudo-SIERA version you crafted calibrated with the same data set you used to test it?
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue
Just ran a regression on pseudo-BsR ERA on SIERA. The results are fairly promising:

Keep in mind neither metric is calibrated for the data set I’m using.
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue
Any reason not to include HR/FB%?
I realize that’s something that deserves high regression and thus will throw off a formula that doesn’t use regression, but if you go the next step, I think it’s worth looking at.
Also, why not use the results of Pizza’s research which shows the y-t-r correlation of various metrics, showing you how much to regress each one? Maybe at that point you could keep multiple years of past data and come up with a pretty damn good projection, too. (Not that this was your original goal.)
Beyond the Boxscore Not a member? Sign up.
by Sky Kalkman on Mar 14, 2010 12:19 PM EDT reply actions 1 recs
My understanding is HR/FB is mostly just Luck/Park
Do you have a link to Pizza’s research?
Another way I thought of to improve it would be to estimate the probability of the 24 base/out states give a pitcher’s 1B, 2B, 3B, HR, BB, HBP, and playing time and re calculate the linear weights based on those probabilities. (I don’t know how we could do this or if we could. Colin? Patriot? Anyone?)
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue
link
Put something together that roughly accounts for the double play advantage ground ballers have.
Introduce a new term “E” into the BsR equation.
E =( -0.46*((0.783*(1B+BB))(0.005(GB/FB) + 0.07))) + 1.21
BsR = [A*B/(B+C)]+D+E
The -0.46 is the linear weights value of a GIDP, the 0.783 roughly scales singles plus walks to GIDP opportunities. The .0005*(GB/FB) + 0.07 is a regression equation, not nearly as good as I’d like I might add, that attempts to estimate expected GIDP conversion percentage as a function of GB/FB ratio. 1.21 adjusts the league average to zero.
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue
Never mind.
I tested this and a HR correction factor based on on base average and they both totally shot my RMSE.
Anyway, after reading Patriot’s comment, I think the best thing to do between now and next year is to just make sure we’ve got all the data right and get all of the fudge factors calibrated perfectly. Next year, if it works, we can proceed, perhaps adjusting for a few more skill interactions.
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue
If you want to include DPs, I would use an expanded BsR version as my basis, and estimate DPs as a function of GB%*(estimated singles + walks). Which is sort of what you did above in your E factor, except in this case it could be inserted directly into the standard BsR equation in the appropriate factors, while maintaining the properties of BsR.
I re adjusted it, ditching the GB/FB part and replacing it with GB/PA. Now, the “B” term should be (if I’m not wrong):
B = [1.4 x TB – 0.6 x H – 3 x HR + 0.1 x (BB+HBP) – 0.9 x (0.783 x (1B+BB) x (0.2 x (GB/PA) +0.05)] x 1.1
And, again if I’m not wrong, we add 1 out per xGIDP [ 0.783 x (1B+BB) x (0.2 x (GB/PA) + 0.05) ] to the “C” term.
So, my “C” term is:
C = PA – BB – HBP – HR – H + 0.783 x (1B+BB) x (0.2 x (GB/PA) + 0.05)
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue
You might need to adjust the B multiplier to ensure that total estimated runs equals total runs scored once the DP terms are added.
Right.
I’ve been using fudge factors in the B and C terms the entire time to make the metric consistent with actual outs and runs. We’ll see how well it works with next year’s data.
Beyond the Box Score / Capitol Avenue Club / shwitter: @CapitolAvenue


































