Opening note: If you're not familiar with the Wins Above Replacement framework, or the three places to find WAR(P) metrics, I'd strongly recommend that you first read this article by Steve Slowinski over at FanGraphs. Then come on back, and enjoy this article. Thanks!
A few weeks ago, I raised an item for discussion that I'd been considering for a long time. The sabermetric community has three major, oft-cited metrics for determining wins above replacement, but the three metrics are all set to slightly different scales. We've got FanGraphs's wins above replacement (commonly abbreviated to "fWAR"), Baseball-Reference's wins above replacement (commonly abbreviated to "rWAR" or "bWAR" -- I default to rWAR), and Baseball Prospectus's wins above replacement player (commonly abbreviated to "WARP").
I see an issue with the three metrics -- these frameworks all look the same, and even share much of the same name, but they're really not. Each uses a different level considered to be replacement level, they have different inputs, and they cannot be compared apples-to-apples when working in conjunction, because the underlying idea of "what a win is" is different between the different methodologies.
The best way to explain this might be through an example. Take a look at Kendrys Morales and how the three WAR metrics value his 2012 contributions to the Angels:
From one perspective, that looks like all three systems value his contributions equally. All three systems are saying that Morales was worth between 1.8 and 1.9 wins more than a replacement player would have provided, given the circumstances.
But FanGraphs has a different replacement level than Baseball-Reference, and both of those methodologies have a different level than Baseball Prospectus. In 2012, FanGraphs determined that the overall population of position players was worth 691 WAR. Baseball-Reference said that the overall population was worth a little over 517 wins above replacement. Finally, Baseball Prospectus had the number closer to 433 wins above replacement players. This is actually a pretty big difference.
That means that the three systems actually don't judge the player the same way. Each win via fWAR is, by one measure, less valuable of a win than via rWAR or WARP, since the population of available wins is smaller. And via that same reasoning, each win via WARP is more valuable than a win via rWAR or fWAR. So in order to get a clear picture of how to compare the score between systems in a way more akin to comparing apples to apples, we need to scale each of these metrics to some baseline.
So, guess what? I finally did that. Hooray!
Now, what we can do is twofold. First, and perhaps most importantly, we can compare WAR metrics across different systems, in order to see which players and performances exist as extreme outliers. If one system rates a player much differently than the other two systems, we can more clearly see how big the difference is, while operating on the same scale.
The other thing we can do, a thing which some folks think is of questionable value, is that we can can average the disparate WAR(P) values into a single number. This number, which I call WAR Index (WARi), I believe is valuable in that it gives a snapshot of the entire existing saber community's look at a single player's season or career. While many people neck-deep in objective analysis prefer one form of WAR(P) or another, many casual fans or people new to sabermetrics may just use whatever they are presented with.
First, this article is here to tell you how I did it, and open up for discussion any questions about my methodology. After this, we'll talk a little about the (qualified) position players from the 2012 season, and how they look via the WAR Index. And soon, we'll get to pitchers as well as other seasons and careers. Get pumped
Okay, so as I stated before, we've got our different WAR(P)s, used with different scales. What we need to do is scale them to one another, so that they sit on a more even playing field. What I chose to do was based on a methodology described by (you guessed it) Tom Tango over at The Book Blog. Here's what he said:
Let’s say for example that Fangraphs has 486 pitcher wins and Baseball Reference has 400 pitcher wins. (I don’t know what it is exactly, so this is just an illustration.) And let’s say there’s 43,000 innings. Fangraphs has given out 86 more wins. And they did this based on 43,000 more innings. 86/43,000 is .002. Therefore, you simply need to add .002 wins for each inning, in order to align both to the same replacement level. A pitcher with 200 innings therefore would get 0.4 more wins added to his rWAR numbers to align it to the Fangraphs’ win number.
Great. I used this methodology for pitchers, except I replaced the fake numbers in his example with the real ones. Then I did the same thing for hitters, except I replaced innings pitched with plate appearances. For this particular article, by the way, I'm only focusing on hitters. For more on pitchers, check back at a later date.
Oh, and here's an important thing to note: I chose Baseball Reference's rWAR as my baseline. Not only does total rWAR sit in between fWAR and WARP (most of the time) on a yearly basis, but it also has a pretty easy-to-use scale for judging how a particular number of WAR relates to a player's performance. ~0 WAR means that you're replacement-level. ~2 WAR means that you're starter-caliber. ~5 WAR means that you're an All-Star. ~8 WAR means that you're an MVP candidate. Simple and clear, though it is, at best, a guideline, not a rule.
So I took the total number of fWAR (and WARP), compared it to the total number of rWAR, and then divided that number by the total number of plate appearances in the MLB over the season. I got the two numbers listed below.
2012 fWAR to rWAR Conversion: -0.00094 fWAR/PA
2012 WARP to rWAR Conversion: 0.00046 WARP/PA
So, to convert a player's fWAR to rWAR, I multiply that first number by their total # of plate appearances, then add that to his fWAR, to get an adjusted fWAR on the rWAR scale. I can then do the same for WARP, putting all three numbers on similar scales.
As an example of the difference this makes from the raw data, let's look at a couple of tables below.
This is what the WAR scores from these three players look like before being adjusted to my scales. Since FanGraphs had a greater number of total fWAR among all position players than there was for rWAR, we apply a negative fWAR/PA adjustment. Since BP had a smaller number of total WARP among all position players than there was for rWAR, we apply a positive WARP/PA adjustment. The end result looks like this:
|Name||fWAR (adj.)||rWAR||WARP (adj.)|
As you see, in some cases the disparity gets smaller. Goldschmidt is a great example of how the different values look quite a bit different at first, but after you scale fWAR and WARP to rWAR, it shows that all three systems value him very close to the same. In Ian Desmond's case, you can see that the differences between the systems still exist, and in a very large way, but they're not *quite* as pronounced as they once were. And in Mike Trout's case, you can see that scaling brought fWAR's projection of Trout's value to the level of WARP's, but leaves rWAR as a bit of an outlier.
One last thing that I need to bring up is that calculating a player's adjusted WARs and WAR Index over a career is kind of a painful process. Why? Because, unfortunately, the adjustments per plate appearances vary from season to season. The adjustment from fWAR to rWAR may be exactly the same for 2012 and 2011 right now, but it is different for 2010, and almost every single year prior that I've looked at. And while WARP may be a smaller total amount for hitters than rWAR for 2012 and 2011, that wasn't always the case. The total WARP for hitters in 2010 was actually more than total rWAR, so a different adjustment needs to be made. So it's a labor-intensive, but somewhat rewarding process to do this over several years, as each year has (two) different adjustments that need to be used to calculate WARi. It can take some time.
Nevertheless, with these adjustments, we finally have a (sort of) equal baseline that we can use to (1) average these three replacement-level measures together and (2) determine which systems have the biggest deltas, or differences between the systems. While it's not a perfect system, it works for what I'm trying to do, which to identify major differences in valuation, and to start to build an overview of how these three systems jointly value a player.
Excited? At least moderately intrigued? I hope so. Later today, I'll share the qualified hitters for 2012, and show you how they stack up in terms of WAR Index, and where the biggest differences in valuation come from between the WAR systems. Stick around!