clock menu more-arrow no yes mobile

Filed under:

Should the OBP formula include errors?

The current OBP formula doesn't give batters credit for reaching base on error. How does adding ROE affect how we use OBP?

Starling Marte leads the NL in ROE, with nine so far in 2014.
Starling Marte leads the NL in ROE, with nine so far in 2014.
Brian Kersey

Reaching base by an error gets no respect. Batters get angry when scorers turn hits into errors, and no wonder. Plate a run on an error, and you might get robbed of an RBI. And, of course, reaching on an error lowers your batting average and your on-base percentage.

Yes, OBP too. The official formula for OBP has only hits, walks, and hit by pitches in the numerator:


This quirk is an old one, and is thus a favorite target for sabermetric writers who otherwise love OBP. The Grupo Independiente para la Investigación del Béisbol (GIIB), a Cuban group interested in applying sabermetric principles to the Serie Nacional, published an article last fall proposing a new formulation:


This new formula, which they referred to as gOBP, both credits the batter for reaching on errors and penalizes the batter for sacrifice bunts. They argue first, that any baserunner gives his team a chance to score, regardless how he reached base; second, that the batter can influence whether a batted ball becomes an error*; and third, that if HBPs (which are basically mistakes by the pitcher) are counted as positive events in OBP, then errors (mistakes by the fielders) should as well. To support these arguments, they show that team gOBP correlates better with runs per game (R/G) than the traditional team OBP.

* - This idea is intuitive but previous research couldn't find any relationship between speed and ROE.

In this article, we will extend this work to MLB, investigating whether adding ROE to OBP helps to predict team scoring, future performance for individual batters, and batter/pitcher matchups.

Predicting Runs Per Game

The conclusion of the GIIB article shows that team gOBP has a correlation coefficient of 0.95 with R/G, a slight but meaningful improvement over the correlation coefficient of 0.93 between team OBP and R/G. This first test is straightforward: using Retrosheet, I collected team R/G, OBP, and gOBP for all 1,482 team seasons dating back to 1955. I then fit a linear model to these data and computed the correlation coefficients for each pairing. The results are below.

R/G with... r
OBP 0.899
gOBP 0.896

In fact, traditional OBP predicts team R/G very slightly better than the new OBP formulation. This matches the findings of James Click in a 2004 Baseball Prospectus article.

Predicting Future OBP

But there are other, better methods for predicting team runs scored. Statistics like weighted on-base average (wOBA) and runs created assign different weights to different events (e.g., a home run and a walk), and correlate even better with team offense. What about individual performance? Maybe gOBP does a better job of predicting a batter's true talent level, and thus is less random from one year to the next.

To test this, I collected all batters in the Retrosheet database since 1975 who logged at least 300 plate appearances in two consecutive seasons. (Multiple batters, of course, could appear multiple times.) This covered 5,607 batters, from Barry Bonds's 2002 (.582 OBP, .587 gOBP) to Mario Mendoza's 1979 (.216 OBP, .219 gOBP). As before, I fit a linear relationship between each statistic in year 1 and the same statistic in year 2, and determined the respective correlation coefficients.

Year 1 vs Year 2 r
Using OBP 0.615
Using gOBP 0.612

As was the case for R/G, the traditional OBP formula actually does slightly better at predicting next year's OBP than gOBP does for next year's gOBP.

Predicting Batter/Pitcher Matchups

Okay, one last hypothesis. Even if it's not better at predicting team offense, and it's not better at predicting batter performance, maybe gOBP does a better job of predicting how a given hitter will fare against a given pitcher.

Recall that we can use the batter's OBP, the pitcher's OBP, and the league OBP to find an expected OBP for a given matchup using the odds ratio. Since gOBP is still a proportion, we can use it to perform the same analysis. To determine which is more accurate, we first group the batters and pitchers into bins with width five points (.005). We then find an expected OBP for all pitchers and batters in that bin, and compare this to the actual results of those matchups. As an example, consider the first pair in our database: David Aardsma and Bobby Abreu, who faced each other once in 2010.

Batter Pitcher
Name Bobby Abreu David Aardsma
2010 gOBP .361 .297
Bin .360 .295
Expected gOBP .324
Actual Result 1.000
Bin gOBP .374
Squared Error .0025

To compare the performance of OBP and gOBP, we compute a weighted mean squared error (WMSE) to quantify the difference between the expected and actual statistic for the set of batter/pitcher matchups in each bin. The number of plate appearances is used to weight each bin, so that more common pairings affected the final score more than rarer ones. We collected all matchups between batters and pitchers with at least 50 total PAs in a season from 2010 through 2013, a sample of more than 750,000 PAs. The WMSE (along with the unweighted mean squared error) for both OBP and gOBP are given below.

2010 0.0031 0.0013 0.0030 0.0014
2011 0.0035 0.0015 0.0033 0.0016
2012 0.0032 0.0014 0.0033 0.0015
2013 0.0034 0.0016 0.0032 0.0016

Once again, the traditional OBP slightly outperforms gOBP in this application.


So, despite the logical arguments in its favor, including errors and sacrifice bunts in OBP does not improve the predictive power of the statistic, and actually makes it slightly worse, when applied to MLB data. What, then, to make of the GIIB article? What makes Cuban baseball so different than its American counterpart?

For one thing, there are a lot more errors per game in Cuba than in the States. Consider this table, that compares the number of errors per game in the 2013 MLB regular season with the first two stages of the 53rd Serie Nacional (2013-14).

League G E Fld % E/G
SN (Cuba) 1054 1143 0.973 1.084
MLB 4862 2747 0.985 0.565

That's a difference of about one error every two games. This seems insignificant, but we can use Tom Tango's run environment generation program to see what kind of effect those extra errors would have on offense. Plug in the 2013 MLB batting statistics (counting HBP as BB and ROE as hits) and the program estimates a run environment of 4.8 R/G*. But double the amount of errors, and that number jumps by half a run to 5.3 R/G.

* - This is, of course, much greater than the actual 2013 run environment, where 4.2 runs were scored per game. But the program assumes no extra outs are made on the basepaths, which will limit the total number of runs scored.

So even if gOBP doesn't help predict the performance of Major Leaguers, those working in leagues where the fielding is not so sure -- including high schools, colleges, and minor leagues -- should consider including errors in their on-base percentage formulas.

. . .

All MLB statistics courtesy of Retrosheet and Baseball-Reference. Serie Nacional statistics courtesy of the National Institute of Sports, Physical Education, and Recreation (INDER).

Bryan Cole is a featured writer for Beyond the Box Score who's fielded his share of angry emails requesting that errors be turned to hits and earned runs turned to unearned. You can follow him on Twitter at @Doctor_Bryan.