NEW YORK, NY - AUGUST 10: Joel Pineiro #35 of the Los Angeles Angels of Anaheim delivers a pitch against the New York Yankees on August 10, 2011 at Yankee Stadium in the Bronx borough of New York City. (Photo by Mike Stobe/Getty Images)
The idea of creating wRA (weighted Runs Allowed, based off of wRC, weighted Runs Created, the linear weights based stat used for evaluating hitters) started after seeing a debate on Twitter about the Ottoneu scoring system. Regardless of how well you do it, the ultimate goal of pitching is to record outs, obviously, doing it well is better, but even "eating up innings" as the phrase goes, has some value. The argument of solely using batters faced didn't make logical sense to me, as it only punishes the pitcher for facing more batters, even if the pitcher faces the minimum 3 batters per inning.
The scoring system in Ottoneu, for those who don't know, is based on work by Merry, using the FIP (fielding independent pitching) constants and innings pitched. When I thought about the idea of only using batters faced, I had the idea of "What would happen if a pitcher pitched a perfect game, but didn't record a strikeout?" Francisco Liriano's 6 walk, 2 strikeout no-hitter earlier this season comes to mind as a slightly less extreme example of this situation.
In that proposed system, the no-strikeout perfect game would be a negative game, because the pitcher failed to record a single strikeout, and he's being penalized for facing 27 batters, but due to pitching a perfect game, at no point were his team's chances at winning hurt. A perfect game is, by it's very nature, impossible to lose. Possibly not the ideal way to go about recording a perfect game, because it's been proven repeatedly that more strikeouts are better (SIERA has done some great work on this front), but this game had significant positive value for his team, and even if the pitcher was exceptionally lucky, why should he be penalized for it?
Yes, Pedro Martinez in 2002 was better than Ryan Rowland-Smith in 2010, but we knew that already. The real question remained how much better was he? WAR (Wins above Replacement player) only tells us such much, and is heavily reliant upon playing time (more playing time with good results is better for a team than less playing time with the same relative results). FanGraphs WAR is slaved to FIP, and FIP, while reliable, is still a metric that could be improved upon. Trying to solve this current problem of how to improve Ottoneu's scoring system, I asked Niv if anybody had ever tried to calculate wRC or wRC+ against pitchers, thinking that if the basic formula behind wRC was good enough to use for hitters in Ottoneu, it might be able to be adapted as a metric to use for hitters as well.
Any stat ending in +/- in sabermetrics means that it's scaled to the league average (and generally park adjusted as well), so the average of all wRC's in the league, after being scaled to account for playing time, would be 100, and any deviations from 100 are percentage points better or worse than the league average, meaning a wRC+ of 120 is 20% above average, an 80 being 20% worse than average, and so forth. Applied to a pitcher, this would tell use how good or bad a pitcher was at limiting the total amount of offense against him.
Simply put, how often a pitcher gave up singles, doubles and triples, as well as the walks, strikeouts, hit-by-pitches, and home runs already incorporated into FIP, with the appropriate weight given to every event, compared to the league average pitcher. I knew that wRC already existed, and has been used, tested and confirmed to be a completely reasonable metric, so why not look into the wRC allowed by any given pitcher in a season? That should tell us exactly how good or bad a pitcher's results were.
As far as I knew, this hadn't been done. BP has TAv (True Average, an all inclusive BP proprietary stat that works in the same fashion as wOBA or wRC+) against, but it isn't publicly available, and is harder to explain to someone not involved in the statistics world. On the other hand, wRC+ is incredibly easy to explain, because it can be given in terms of percentage points better or worse than the average. Obviously, the number of hits a pitcher gives up would be a huge factor in this statistic, but over a significant number of innings, the quality of the defense should normalize and BABIP (batting average on balls in play) would regress towards that predicted by a pitcher's batted ball profile and skill-set.
When FIP was first created, it was in response to BABIP fluctuations drastically affecting a pitcher's performance, and with walks, strikeouts, home runs allowed, and hit by pitches remaining fairly constant from season to season for most pitchers, FIP made sense to use as an evaluation metric, because it still accurately separated the best pitchers from the worst pitchers. With the invention of linear weights, we can exactly calculate just how much each event matters, so the noise produced by a good or bad BABIP can be accounted for to some extent.
The stats I've come up with to date to use are wRA, the net accumulation of weighted Runs Created in a given time period/season(s)/career, wRA/PA, or how many weighted runs a pitcher expects to give up per batter faced, and wRA/9, how many weighted runs a pitcher should expect to give up per 9 innings pitched, or, a linear weights equivalent of ERA. Obviously, this statistic is not meant to be the end-all of pitching statistics, but I see it as a step forward, slowly working to expand our horizons for statistical evaluations of pitchers. This is not to say that FIP shouldn't be used, because it has proven to be a reliable metric, but there's always more than one way to evaluate a player, and none of them are perfect.
wRA is built using the same linear weights that make up wRC. Points are accumulated exactly like wRC, except in this case, since a higher wRC means that a pitcher allowed more "aggregate" offense (not necessarily in the form of runs given up, but did allow more total bases and/or baserunners), a higher wRC is bad. This could be accomplished by either pitching poorly, such as Brandon Backe in 2008, who accumulated a wRA of 124.12 through 168 2/3 innings, or a 6.622 wRA/9, or by simply racking up lots of innings, such as Roy Halladay in 2003, where he accumulated a wRA of 98.19, but did so by virtue of logging 266 innings, or a wRA/9 of 3.322. These examples are given to remind you to keep things in perspective. Don't just look at the wRA allowed, look at wRA/PA, wRA/IP, or wRA/9, and see just how much offense they're allowing relative to how much they've pitched.
Due to fluctuations in luck and the order in which hits are given up, one can't necessarily reliably predict ERA using wRA, but it should serve as a reasonable approximation. As an example of how the order of hits can matter, we have two pitchers, Albert and Brendan. Both pitchers pitch complete innings, and do not get pulled mid-inning. In every inning, Albert first gives up a double, walks the following batter, then proceeds to record three consecutive strikeouts. In every inning, Brendan first allows the walk, then the double, then records his three consecutive stikeouts. Both pitchers would have identical results according to wRC, but Brendan is likely going to give up far more actual runs, due to the order in which his events occurred. Luck, order of events, and grouping of events all play an important factor, so they cannot be discounted when discussing ERA, and thus comparing wRC to ERA (or RA, if you prefer to avoid the earned runs vs unearned runs mess).
Using the 2010 data, here are a few sample players:
The league leader, Felix Hernandez, the 25th percentile R.A. Dickey, "league average" Joel Piniero, the 75th percentile Jeremy Bonderman, and in last place, Ryan Rowland-Smith, who was truly terrible (from 2002 to 2010, this was the 2nd worst season by wRA/PA and wRA/9). He was a full .02 wRA/PA worse than Zach Duke, who finished next to last. A gap of .02 wRA/PA is also approximately the gap between the respective 2010 seasons of Zack Greinke and Vin Mazzaro.
There are some interesting trends to be found in this data, and they fit with the intuitive understanding of how baseball works. If you plot wRA/PA as a function of ground ball rate, there's a noticeable negative trend, so generating more ground balls tends to lower one's wRA/PA. Logically, ground balls rarely turn into extra base hits (which are obviously far more damaging than a single), so this makes sense. As the number of plate appearances a pitcher has in a season increases, the overall wRA/PA tends to decrease. Better pitchers tend to pitch more (without regards to injuries), because teams know who the better pitchers are, so they are allowed to pitch more. The long reliever / spot starter is a long reliever because the 5th starter is assumed by the coaching staff (sometimes incorrectly, but typically the coaches are correct in this assessment) to be better. As line drive rate goes up, wRA/PA goes up. Line drives drop for hits far more often than ground balls and fly balls, and often result in extra base hits, so a higher line drive rate would therefore be worse. All of this data makes logical sense, and if the data matches the logical conclusions, then usually the idea is a valid one.
However, there's also a strong correlation between BABIP and wRA/PA, which also makes sense. The more hits a pitcher allows, the worse his results are, so this obviously isn't perfect, but every measure can be, and should be, improved on. Generally speaking, pitchers who give up more ground balls are better, because while they tend to give up more hits overall than fly ball pitchers, those hits are often singles, whereas fly balls typically turn into doubles, triples and home runs. So, the age-old question of how exactly to account for this remains unanswered. Linear weights offers the best solution that I currently know of for evaluating hitters, so I decided to extend that analysis to pitchers. Better weightings, better methods of calculation, and better record keeping are all possible. These things are improvable, so I don't think that this field (or any other) has been explored to its fullest. No statistical evaluation tool will ever be perfect, but the pursuit of perfection is still a worthwhile goal, because there's always a more accurate metric always exists. While every player might be a sample size of one, the aggregate of all those individuals is still an incredibly powerful tool for evaluating.
All raw data used was obtained from Baseball Prospectus, and is used with permission. The linear weights used are the 1974 to 1990 weights from TangoTiger, and are used with permission.