Making small sample defensive metrics less volatile

How much value is there in a single season of fielding data? More than we're currently extracting. The trick is to let data inform data by using Bayesian methods.

Executive Summary

Current fielding metrics have annual volatility approximately as great as BABIP. This creates difficulties in establishing a timely evaluation of a player’s defensive ability.

This article describes a process that can be used to reduce the volatility of defensive metrics. It finds that a modification to the current calculation of DRS, using Bayesian inference, makes a proxy defensive metric about as "reliable" as offensive metrics (e.g., wRC+ and ISO). This process is demonstrated with publicly available Inside Edge fielding data. The crux of this technique is a two-step process: (1) establish an initial estimate of fielding proficiency for a play difficulty category (e.g., Likely) based on out-of-category data (e.g., Routine, About Even, etc.) and then (2) correct this estimate with in-category fielding data. The advantage of this technique is that it makes a single season of defensive data more representative of a player's true talent level.

Additionally, the article discusses how the re-development of defensive metrics, using a recursive Kalman filter, might reduce their volatility to an even greater extent.

---

You know the phrase: defense wins championships. That cliché is owed to Paul "Bear" Bryant – he also noted that offense sells tickets – and is often used to describe why football or basketball teams have succeeded. Baseball fans are more accustomed to "good pitching beats good hitting," or the like. Since the mid 2000’s, though, defense has been receiving significant attention. The Royals’ World Series run alerted many outside the sabermetrics community to this aspect of the game.

Of course, any discussion of defense can't go too far without discussing some of the limitations of defensive metrics. This isn’t to deride defensive metrics; they are capturing more and more of the picture. Currently, volatility is their greatest limitation.

DRS and UZR tell the right story when you have the luxury of looking at multiple years of data. On the other hand, single season numbers can be misleading due to their high uncertainty. Sabermetric minds have traditionally combated this in one of two ways, either (1) waiting for more data (which isn't really a solution) or (2) regressing back to league average. Unfortunately, the latter has two main problems. First, regression values are often arbitrary. Second, a standard regression can mask the very trends we wish to illuminate.

There’s another option. Specifically, we can better employ the data we have. We need to focus on the performance of only the player in question and only in recent timeframes (i.e., within the past season). Before you all shout "small sample size"—that’s the whole point. We need metrics that converge faster. I don’t think we need all-new metrics—I have no appetite for that. So how can this be done? With a Bayesian filter.

This seems like the classic problem for a Kalman filter (or one of its close cousins). A Kalman filter estimates the state of a random variable that cannot be measured directly and reduces the process noise of computing these parameters by recursively refining the state estimates. Defensive capability fits into the category of non-directly observable quite well. I’m not going to develop a defensive metric Kalman filter in this piece as the available data isn’t in an easily adaptable "stream" format. I’d need game log style date stamps, park, and other control metadata. Besides, that’s beyond the current scope. Instead, I’ll show the value of a Bayesian inference-based data "preprocessing" approach that could be applied to any play category rate based system (e.g., DRS and UZR). This is a smallish step to demonstrate "there’s a pony buried somewhere in here".

Making a Bayesian inference involves establishing an estimate of an unknown probability and modifying this estimate based on relevant data to arrive at a posterior probability (to paraphrase Peter Lee, Bayesian Statistics, section 2.1).

For a data source, I used Inside Edge's fielding data found on Fangraphs, since it's the most detailed data set widely available. Frankly, this isn't so much an endorsement as an operational constraint. Keep in mind the objective—to reduce the seasonal volatility of defensive metrics. Recall that Inside Edge classifies plays as follows:

Impossible (0%)
Remote (1-10%)
Unlikely (10-40%)
About Even (40-60%)
Likely (60-90%)
Almost Certain / Certain (90-100%)

Since the "Impossible (0%)" category contains, by definition, no successful plays for any player, I won’t consider that category any further. Five play categories remain. Here's an overview of the process that I'm using.

Say I'm interested in the "unlikely (10-40%)" play category. If the player is above average in the "remote (1-10%)", "about even (40-60%)", "likely (60-90%)" and "almost certain (90-100%)" (i.e., the four "other" categories), he’s probably above average in the "unlikely (10-40%)" category also. I've established an initial state estimate as one of three possibilities, "good," "bad," or "average", for each play category by considering the other four play categories. Then, the in-category fielding data is used to correct the initial state estimates. The key is that the initial state estimates and the state refinements use different data. If some of that was difficult to follow, don't worry, I'll go step-by-step in the detail to follow.

That’s a lot of information in one graph, I know, but it’s relatively easy to unpack. To start, consider play success rate as a binomial parameter with a category-specific nominal success rate. Assuming that the player in question converted plays at the nominal success rate, the below average state corresponds to a number of successes that would occur (for a known sample size) with a probability less than 0.3. The above average state corresponds to a number of successes that would occur with a probability greater than 0.7, and the average state otherwise. Stated more simply, performance in a colored zone corresponds to the "average" state. Performance above (or below) the colored zone corresponds to above average (or below average).

For example, the topmost color band, green, represents the number of successes against the number of failures in the "Almost Certain/Certain" category. As you would expect, fielding successes far outnumber fielding failures. As you progress down the graph through purple, orange, yellow, and blue, the failures become more frequent in relation to successes. I chose third base for the example above, but this method can apply to any position. The next figure will look at Mike Moustakas' 2013 season for reference.

The state "value" column in the next figure’s table assigns a -1 (below average), 0 (average), or +1 (above average) to each of the five Inside Edge categories. The first goal is to establish an initial state estimate not on how well a player performed in this category, but rather how the player performed in all the other categories. If the other states sum to a value greater than 1, the initial state is determined to be above average. Less than 1, below average. Average otherwise.

To perform the Bayesian inference (what I’ve coined "self-regression"), an assumption about the confidence in the initial state is needed. A standard error could be used; however, I preferred to simulate a measurement by assuming a league average number of opportunities. The simulated measurement will be a number of successes and failures corresponding to the category specific 30, 50, or 70^th percentile conversion rates.

Next, I combined the actual play data and the simulated measurement using Bayesian methods to establish the best estimate of how well Mike Moustakas would have performed in each category of play given an extra bunch of innings.

How much does this self-regression affect Mike Moustakas’ 2013 season? Well, he’s probably not as good at "about even" or "likely" plays as he performed. He’s probably a little better at "remote", "unlikely" and "almost certain" plays than he performed. So that’s a little worse here, a little better there—this case is going to be a push. This will be shown graphically in the first treemap (three figures down).

Fielding chances categorized by difficulty can be combined with number of successes (or success rate times number of opportunities) to evaluate defensive value; the Fielding Bible does this with DRS. Unfortunately, the Baseball Info Solutions data driving DRS is not publicly available, so I can’t illustrate this benefit on DRS directly. Inside Edge fielding data can be used to derive a DRS-like "toy" metric, which I’ve been calling "IERP" (Inside Edge Run Prevention). This hybrid Inside Edge data/DRS methodology has some flaws that I won’t elaborate on here. IERP is not better than DRS, since DRS has many valuable improvements beyond the basic plus/minus system. However, it will serve my purpose because it can be used to demonstrate a reduction in year-to-year volatility. It probably goes without saying, but a reduction in year-to-year volatility can help us understand defensive talent better by cutting through the noise faster.

IERP correlates well with DRS, as shown in the graphic below. The sample includes all the position players who logged 300 or more innings in each season from 2012-2014 (at least 900 innings) at a single position. For example, Mike Trout is included in the sample as a center fielder. At that position, he played 885.2, 952.2, and 1314 innings in the years 2012-2014; his 759 and 12 innings in LF and RF are excluded. A total of 164 players meet this criterion, totaling just shy of half a million innings and averaging 950 innings per player-season. I had to have a relatively low innings per season limit so that I had many players with performances in all three seasons; this was required to have a sensible measurement of volatility (more on that to come).

DRS has units of "runs saved", whereas IERP has units of something like "net number of outs". To sidestep the different units, I remapped both DRS and IERP to standard deviations above the mean (which is unitless). Because these parameters are measuring the same thing, it makes sense to harmonize the units (this isn't all that necessary because R² is independent of units). The 3-year correlations of IERP to DRS and self-regressed IERP to DRS are essentially the same quality. BIS’s data set used by DRS is more granular than Inside Edge data, so the variance is due to Inside Edge's relative coarseness and other corrections in DRS (e.g., ball-hogging). For reference, R² between DRS and UZR is greater than 0.8.

For those who love equations, here is the conceptual process for calculating IERP. The following figures are an attempted re-computation of the Fielding Bible’s plus/minus system with Inside Edge fielding data. In a nutshell, IERP is a sum of play credits for successfully converted plays and penalties for booted plays.

In this equation, play credit is the complement of the nominal conversion rate, and play penalty is the nominal success rate. Fielding Bible’s article explains "credit" well. It is 1.0 (which is the occurrence of a play) minus the nominal conversion rate (which is the expectation that the player makes the play). "Penalty" is similarly defined.

The next figure is a treemap (a visual aid I first encountered when mapping which directories on a hard drive consume the most storage). Treemaps use rectangles to show hierarchical relationships that might otherwise be difficult to appreciate. Here, the left rectangles for both actual performance and self-regressed performance show "success credit" for Moustakas in 2013. Within each rectangle, you'll note subdivisions based on Inside Edge play difficulty, so it’s easy to see which difficulty contributes the most to his success credit. It’s the same for unconverted opportunities on the right. Here, the green and purple regions dominate, which means that these categories drive IERP. The lesson? The handling of routine plays should not be overlooked.

The next figures are Moustakas’ 2012 season.

Now that the contributors to IERP have been visualized, I can finally explain why I performed this complicated self-regression. The performance sum includes the product of small credits with large numbers of occurrences and large credits with small numbers of occurrences. The regression attempts to mitigate this uncertainty-prone calculation by substituting number of occurrences with the product of success rate and number of play opportunities.

To evaluate the merit of this process, I'll consider volatility computed on a per inning basis and relative to a 3-year baseline (this is both the total extent of the available Inside Edge database and a reasonable amount of time for DRS/Inn to stabilize).

I'm going to define FV_DRS (fielding volatility) as the RMS (root mean squared) of (DRS/inn)_{each year} – (DRS/inn)_baseline. Think of this as the typical difference between this year’s DRS and a 3-year DRS, just on a per inning basis. FV_IERP replaces DRS with IERP. Here’s the gruesome detail:

Next, I'm going to compare the fielding volatilities, FV_DRS,FV_IERP, and self-regressed FV_IERP. Since FV is actually an error term, FV is shown as a cumulative probability distribution with a gamma distribution fit curve. In plain English, if the curve moves to the left, there is less fluctuation in year-to-year values.

Two separate comparisons need to be made. First, FV_DRS and FV_IERP have approximately the same shape and values. This shows that IERP is a reasonable proxy for DRS in establishing a volatility baseline. Secondly, self-regressed FV_IERP has lesser magnitude than does "raw" FV_IERP. Check out the gamma fits in the bottom half of the next figure for a visualization; the probability density function has a higher peak at a lower FV. In all cases, FV is computed with three years of fielding performance.

The next figure is a demonstration of capability, showing how self-regressed IERP seasons "group" tighter than do non-regressed IERP seasons. This technique tends to push extreme seasons back to the center of the grouping, while allowing that some seasons are a break from career norms. This is important because a break from career norms could be "a real thing" (e.g., playing through an injury, change in ballpark, precipitous decline etc.).

Succinctly, self-regression as pre-processing to defensive metrics makes more complete use of the available data without overusing any of it. This technique tends to make single season per inning metrics more closely reflect their three-year averages without materially reducing the three-year correlation to non-regressed metrics. In turn, smaller sample sizes are required to establish player talent level.

I’d like to see this formalized into a Bayesian filter (I recommend a Kalman filter) that utilizes the more granular Baseball Info Solutions database. To give you a feel for how much benefit there would be in this development, I’ve speculated the volatility improvement in the next figure. Yes I used the word "speculate"; I could have called that an "informed expert opinion" or a "design objective"— But those are just grownup words for "speculation". I don’t mean to sandbag my own work or beliefs, but I also wish to provide nuance in my confidence.

You’ll be forgiven if you fail to see the significance of that DRS volatility projection (shown grey) — Normalized RMS uncertainty probability density functions aren’t the most intuitive data presentation. To provide context, I’ll reintegrate those curves into cumulative probability functions and show the volatility of batting stats that we’re accustomed to projecting forward and utilizing from year-to-year. The batting statistics are taken from the same population of 164 position players with at least 300 fielding innings in each of the 2012-2014 seasons. There are some small sample sizes in there, but there's nothing less than 100 PA in a single season. Most players are full time players (500+ PA). Volatility is calculated for batting statistics just like for DRS & IERP. It’s the RMS of the Gaussian normalized yearly difference.

What do I see? Well, raw IERP (and also raw DRS, not shown) has the same volatility as BABIP. Annual BABIP is mostly a crapshoot, so that confirms my expectation. ISO, wRC+ (and also wOBA, SLG, OBP, though not shown) are a fair bit better than BABIP. I don’t know about you, but when I see a one-year jump in ISO or wRC+, that’s a cause for interest. A one year blip may be a "breakout", or the change might not last. We need to look deeper. Self-regression of IERP achieves that level of volatility; it makes the stat about as "reliable" as ISO or wRC+, an improvement I’m pleased with. Further up on the chart, the projected improvement in DRS with the Kalman filter techniques I’ve discussed (but not implemented) could produce volatility similar to plate discipline (K/BB). Plate discipline is in the family of the most reliable statistics on an annual basis. Strikeout rate is better still, but I chose to leave that off.

Is this level of improvement possible? Perhaps not. Based on my experience with this class of algorithms, I believe it is. Surely the burden of proof is incumbent upon the researcher.

. . .

All data courtesy of FanGraphs

Big thanks to Garrett Hooe. Garrett volunteered to prove additional proofreading and editing of this writing; he sifted through many iterations of this piece to make it readable. Tremendously appreciated! Garrett also publishes at www.federalbaseball.com.

Jonathan Luman is a system engineer with a background in aerospace. You can contact him at jonathan.r.luman@gmail.com.