This article is based on the research that I presented this past Saturday at the SABR Analytics Conference.
Let's play America's Favorite Game. No, not "Name That Molina," but the ever-popular "Player A or Player B." I'm going to give you two players and their slash lines, and you choose your option.
Player A put up some solid numbers, slashing .303/.435/.582. However, Player B was absolutely raking with a line of .361/.424/.652. I mean, these are both some pretty impressive lines. Okay, now that you have that information, it's time for you to participate. Who do you take?
Option 1: Player A
Option 2: Player B
Oh yeah, there's one other option I didn't tell you about...
Option 3: You haven't given us enough information and we know you're trying to trick us, so we'll go with this option instead.
By the way, everyone should take option 3.
So who is Player A and who is Player B? Player A is a prospect who pretty much everyone who reads Beyond the Box Score would know, Joc Pederson. He put up his numbers in 553 plate appearances at Albuquerque in the PCL. Player B is someone who a lot fewer people would know, Bobby Bradley. A third round pick for the Indians in 2014, this first baseman put up his numbers in 176 plate appearances in the Rookie Leagues.
Now the point of this isn't to pump up Pederson, nor is it to knock down Bradley. So what is the point? The point is that statistics need context in order to be fully understood. Further, to put these statistics into context, they often need to be adjusted in order to put things on a more context-neutral basis. This isn't an unknown fact in the sabermetric community, as there are many examples of statistics that see some form of context-neutralization. Some of these include wRC+, OPS+, FIP-, and so on. The most common factors that receive attention in the form of adjustment include adjusting for park (think wRC+), era (OPS+, FIP-), or even position (positional adjustments for WAR).
But what about opponents? Rarely do we see statistics adjusted for the opponents that a batter faces. It may be assumed that over a 162-game season opponent effects get washed out. However, I consider it important to adjust these offensive stats to account for opponent, similar in mindset to what I did with RA9 and FIP.
That's the idea behind BXwOBA (Short for Bayesian Expected wOBA). It gives us a defense-adjusted wOBA statistic that we reach through a Bayesian hierarchical model. Through the rest of this article, we'll go through the rationale behind working with adjusted and expected wOBA, the methodology, and then look at a case study for 2013 and 2014.
wOBA: A Two-Paragraph Primer
The offensive statistic that we'll be working with here is wOBA, or weighted on-base average. If you frequent the pages of Beyond the Box Score, FanGraphs, Baseball Prospectus, and so on, you'll know what wOBA is very well. For those who don't know, wOBA was created by noted sabermetrician Tom Tango, and is essentially a weighted sum of the majority of the batting components of offense. These components can be divided up into two groups: the batter vs. pitcher (FIP) components (Walks, strikeouts, hit batters, and home runs) and the batter vs. pitcher and defense (Balls in Play) components (In-play outs, singles, doubles, and triples). For the rest of this piece, we'll be concentrating on the balls in play wOBA components.
wOBA is a results-based statistic, in that it takes the results that happened and creates a statistic out of them. We'll see later on, but we can't necessarily trust the results that we see. Rather, we'll be looking at calculating an expected wOBA and simultaneously adjusting it for defenses faced.
The What and Why of Expected and Adjusted?
I keep throwing around the terms Expected wOBA or Adjusted wOBA (Or in some cases, Expected/Adjusted wOBA). Well, it’s probably about time to define what I mean by each of those terms.
When I talk about the expectation of an event, I’m talking about what should have happened, not what did. For example, remember this play from 2013?
What did happen was Manny Machado made an incredible (And incredibly lucky) play. What the expectation is interested in is what should have happened on that chopper down the line, regardless of Manny Machado’s presence or not.
Adjustment comes down to considering baseball’s unbalanced nature. It’s designed to attempt to account for the fact that over the course of a season, a batter will not face the same pitchers or defenses as another player, even one on the same team.
Overall, all this boils down to not relying solely on what did happen as a result of the play. Why can’t we rely on these results--isn’t seeing believing? The reason for not trusting the results is identical to the reason for why we need to use the expectation. Simply put, this reason comes down to sample size.
Baseball analyses can pose the unique problem that a sample size of 600 may not be large enough. As someone who comes from a statistics background analyzing omics datasets (where sample sizes are more often on the order of tens) there are many instances where I would love to have a sample size of 600. But we know that baseball is unique in that sense. Why is that the case? Because in baseball, a player’s convergence to his true talent level is slow. How slow (I hear you ask)? It’s so slow that if you consider a player with a league average profile on balls in play in terms of out percentage, singles percentage, etc., and look at the variability for a sample size of 300 balls in play (the average number for a qualified player), the difference between their best plausible and worst plausible seasons is the difference between the wOBAs of Jose Abreu and Ben Revere. Obviously, these are two drastically different players. It's hoped that in working with the expectation of each ball in play we'll be able to lessen the variability and get twoards a player's true talent level faster.
The rationale behind adjusting for different opponents follows a slightly different track. As we know from discussions about divisional strength and the wild card, baseball does not play a balanced schedule. For that matter, even if baseball did, batters would face different pitchers and different defenses over the course of the season. When reporting wOBA, it should be considered whether or not a player has to face, say, the Royals outfield or the Rockies infield an extra 12 times a year.
Since we're adjusting for defense, it's important to consider how defense can affect wOBA. The first and most obvious way is through plays made or not made. Make the play, wOBA goes down. Don't make the play, wOBA goes up. But you don't have to make plays like Lorenzo Cain or play defense like Lucy van Pelt to affect wOBA. It can be something more subtle, such as Yasiel Puig's arm holding a runner to a single. These subtle things need to be accounted for too, not just the spectacular plays.
The How of Expected and Adjusted
Warning: For the number-phobic among you, we're about to hit the #gorymath section of the article. Feel free to skip as if you feel so inclined.
Now that we know why we are working with the expectation of wOBA, we actually need to calculate it. As we begin, it's important to consider (1) what is our data, (2) what is our response, and (3) how will we model it? Because there is limited publicly available "hit location" data in existence, we must use the MLB Advanced Media data as downloaded from the Baseball Heat Maps website. I know this is not actually hit location data (it's fielding location), but it will have to suffice. Using the (x,y) locations, we will create our model matrix using a radial basis spline function with a Gaussian kernel interacted with the hit type. Therefore, plays of a similar batted ball type that are closer together in location will have a stronger influence in the model than balls in play that are farther apart.
The response that we want to model is the result of each ball in play, which in this case is one of four levels: Out, Single, Double, and Triple. Mathematically, y ∈ {Out,Single,Double,Triple}. Further, the response is ordinal, in that it can be ordered naturally (Single is better than an out, double better than a single, and triple is better than a double). We will take advantage of this ordinal property in our modeling scheme described below.
Must Reads
The question then becomes, how do we model an ordinal response? There are generally two possible techniques, your choice depending on your philosophy of statistics. The first, a multinomial logit/probit model, springs from classical statistics. This technique works with modeling the log odds of each category compared to a baseline. So this would result in three models: comparing log odds of single to out, double to out, and triple to out. So we have three models and three sets of coefficients. I'm not going to use this model though, for two reasons. The first is that I don't want to work with three sets of coefficients. The second is that I tend to lean Bayesian as a statistician, so I prefer a Bayesian hierarchical model.
The Bayesian method of modeling an ordinal response is a data augmentation model described in Albert and Chib (1993). This model works by working on latent variables to model the observed response. I want to take a few sentences to talk about this data augmentation model. The assumption is that there is some unseen, unmeasurable latent variable that determines the level of the observed ordinal response. If the latent variable z_{i} is less than some set benchmark (usually 0), the response y_{i} comes from Level 1. Once the z_{i} is greater than 0, but less than a benchmark (τ_{1}), the observed response flips to Level 2. This pattern continues until the highest level of the ordinal response is reached.
The rest of the hierarchical model is entirely conjugate, making it possible to fit the model using a Gibbs Sampler, which will output a full distribution of parameters in the model, from which we can calculate the probability of an out, single, double, and triple on each ball in play. Finally, from these probabilities we will calculate the distribution of BXwOBA.
Full hierarchical model for BXwOBA
That takes care of calculating the expectation of wOBA. However, how are we going to adjust for defenses? There are a few possible ways to go about that. The first involves seeing how a player does against the average defensive starting position league wide. However, that data does not exist in the public forum, so this is not a viable option. The second option is to just regress BXwOBA to the league average. However, I don't believe this to be a viable option because it unfairly punishes the players who cannot be so easily defended. So we're going to go with a third, more directed option.
What we'll do is augment the dataset with comparable players. We'll look through the dataset and find player with a comparable hit distribution, draw individual plays from this subset, and add them to the dataset. The implicit assumption is that players with similar batted ball profiles will be defended similarly. If this is true, then a players BXwOBA will not change too much from their true wOBA. However, if a player is being defended differently from his comparable players, then his BXwOBA will change from his observed wOBA.
So what do we want to look for in comparable players? We're trying to find a player with a similar hit location distribution, as well as a similar batted ball profile. For example, you wouldn't want one of Derek Jeter's comparables to be David Ortiz. For one, Ortiz is left-handed and Jeter is right-handed, and Ortiz is a heavy pull hitter. Better, instead, to have Yadier Molina as a Jeter comp. Both righty, both pretty much all-field hitters.
While we could go through each hitter and determine their comparability to each other visually, this would take a large amount of time, and we want a more objective method. So in order to calculate a weight when we are looking at comparing two distributions, we're going to go back to a favorite of mine, the Kolmogov-Smirnov (K-S) test.
The K-S test is a flexible test, as it is able to detect differences in both position and variability. It does so by finding the maximal absolute difference between the two empirical cumulative distribution functions of our samples. However, there is a slight problem. The K-S test is for univariate samples and we have bivariate (x,y) location data. So, we're going to move to a bivariate K-S type of statistic, where we define F[(x,y)]=P(X≤x and Y≤y). Essentially, instead of looking for the largest maximal difference between two lines, we'll be looking at two planes.
Once the weights for comparing players (Calculated from the K-S values), we'll use these weights to draw individual plays to augment our dataset. However, a reasonable question would be, "Could the choice of data possibly bias the results of BXwOBA?" The answer is yes, but we will not allow that to happen in our algorithm. We avoid this potential bias by selecting n^{*} plays at each iteration of the Gibbs Sampler, thus essentially integrating out the effect of data choice in the model.
So, at each iteration of our Gibbs sampler, we follow this procedure:
- Select n^{*} plays to augment the dataset according to the weights calculated using the K-S criterion.
- Draw from the full conditionals defined from the hierarchical model given above.
This procedure is iterated until convergence, which in this case is nearly immediate. Then, from the output of the model, we calculate BXwOBA by the similar procedure as wOBA under the assumption that all FIP components of wOBA are fixed.
BXwOBA and Prediction: 2013 and 2014
Welcome back to those who skipped the last section!
Okay, all this is nice, but for BXwOBA to be useful, it has to be predictive. Preferably, it predicts future wOBA. To take a look at this, I calcuated BXwOBA for all 2013 players who qualified for both the 2013 and 2014 batting titles (A total of 95 players). BXwOBA had similar means and standard deviations as wOBA (Around .335 and 0.034 respectively for qualified players), and the average absolute difference between BXwOBA and wOBA was 0.008.
I also would like to note that BXwOBA is not constant. That is, if the BXwOBA algorithm was ran twice for the same player, two different BXwOBA statistics are highly likely. However, the variability in the final estimate is negligible, as the 95% credible interval for that has length of 0.001.
Finally, we come to the question of correlation with future wOBA. The correlation between 2013 wOBA and 2014 wOBA was found to be 0.5077. 2013 BXwOBA correlated with 2014 wOBA at a rate of 0.5461. This is a small increase in correlation, but this needs to be considered in context. First of all, I have not attempted to adjust the FIP components of wOBA in any way, and this adjustment should help to increase the correlation slightly. More importantly, the data we have worked with is not actually the data we want. Recall, the "hit location" data from MLBAM is not actually hit locations. It is in reality fielding locations. This will affect the accuracy of our model, and therefore the correlation. Given true hit location data, I believe it is reasonable that the correlations with future wOBA will increase by large amounts, up to 0.7 and possibly beyond.
Where Do We Go From Here?
This is a good first step in working with expected wOBA, but it is just a first step. In future work, I would like to try to get access to true hit location data rather than the fielding location data that is publicly available. This should definitely help the model in terms of accuracy. However, if this is not possible I would attempt to apply an errors-in-variables model to the data. This mindset essentially says, "I know that the ball was fielded here. But based on the parameters of the model, the other data we've seen, etc., it makes more sense that the ball actually landed 20 feet in front of the fielder." This model is more computationally intensive, and would require some care in setting the priors on the model.
An additional next step that is clear is to attempt to adjust the FIP components of wOBA to account for pitcher faced. This is necessary to make BXwOBA a more well-rounded, all-encompassing expected/adjusted wOBA statistic.
Summary
All statistics need context. This is well known in the sabermetric community, as there are several statistics that are adjusted to account for park, era, etc. However, opponents are rarely adjusted for, making the implicit assumption that this imbalance is washed out over the 162-game season. BXwOBA tries to account for this difference in opponent, specifically in opponent defenses.
It does so in two ways. The first way is to calculate the expected wOBA of a ball in play based on hit location and hit type. This is accomplished through a Bayesian hierarchical model as defined by Albert and Chib (1993). From this model we can calculate the distribution of the probability of outs, singles, doubles, and triples on each ball in play, as well as the distribution of BXwOBA. To adjust for opponent defense, the dataset is augmented with balls in play from players with comparable hit location distributions. These weights for augmenting the dataset are calculated through an author-defined bivariate K-S criterion.
In calculating the correlation of BXwOBA with future wOBA, it's seen that BXwOBA correlates with future wOBA at a slightly higher rate than previous wOBA itself. This slight increase is acceptable due to the nature of the data, which plausibly lowers the correlation. Future work includes attempts to account for this imperfect data, as well as incorporating the FIP components of wOBA to create a more complete BXwOBA statistic.
References
- Bayesian Analysis of Binary and Polychotomous Response Data, James H. Albert and Siddhartha Chib. Journal of the American Statisticial Assocation, Vol. 88, No. 4 (1993).
- Semiparametric Regression, David Ruppert, M.P. Wand, and R.J. Carroll. Cambridge University Press, 2003
- Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Springer, 2009.
- A. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Giornale dell’ Istituto Italiano delgli Attuar, 4:83–91, 1933.
- N. V. Smirnov. Sur la distribution de ω2 (criterium de m. r. v. mises). Comptes rendus de l’Acadmie des Sciences, 202:449–452, 1936.
. . .
I would like to be sure to thank a few people. First, my adviser Dr. Leanna House, who gave me a couple of weeks off to work on this research, as well as helping with a little code. Second, Michelle Gervasio, who provided me some computing time on her computer to help speed up the process. Finally, the Virginia Tech Statistics Department for giving me access to some of the good computers to calculate the actual BXwOBA statistics.
Stephen Loftus is a PhD candidate in Statistics at Virginia Tech. In his spare time he is an editor and writer at Beyond the Box Score. You can follow him on Twitter at @stephen__loftus.