Navigation: Jump to content areas:


Pro Quality. Fan Perspective.
Login-facebook
Around SBN: The Most Dangerous Division in Sports

A new xBABIP


I've been looking for one good xBABIP calculator. The best I've found so far was slash12's which was posted on this site. I decided to test it and make my own xBABIP. The tests are from 2002-2011.

Star-divide

My formula is 0.025278+0.238105957967707*GB%+FB%*0.257892+LD%*0.365655+IFFB%*-0.37291+IFH%*0.372909+

0.04929*HR/FB

The formula was found using regression. I did some tests to find out which was better. Mine or slash12's.

So here is against future performance

Mine

Correlation

.388
RMSE

.0288

MAE

.022

Slash 12's

Correlation

.352

RMSE

.0343

MAE

.027

Against same season

Mine

correlation

.57

RMSE

.026

MAE

.021

Slash12's

Correlation

.49

RMSE

.027

MAE

.021.

So mine beats slash 12's in every test, same season, and next season. I made a google doc of my xBABIP calculator . You can get it here. IN the calculator only enter numbers in the green boxes.

EDIT: I made the calculator even better with some help from MJww. The correlation to next year is better at .44. The link and the formula have been updated in the post. The other tests are probably even better too. But I can't do them again for a reason that I messed up the spreadsheet I did them in and couldn't do slash12's because I didn't include the fangraphs batted ball% numbers.

Comment 19 comments  |  0 recs  | 

Do you like this story?

Comments

Display:

If I understand this correctly...

The coefficients for LD% and IFFB% are essentially the same? This doesn’t seem right at all, since we know the observed BABIP on LD is something above .700 and close to .000 on IFFB. Or are IFFB also counted in FB, in which case its a double penalty (the order of the coefficient magnitudes would make sense)

Also, with the intercept being so large and all batted ball type coefficients being negative…intuitively this is a little hard to get your head around in term of understanding the model at a practical level.

Finally, since this was done by regression, can you provide the standard errors for the coefficients (or alternatively the t stat or p value, whatever’s easiest)? Intuitively, it makes some sense that HR/FB would be positively correlated with xBABIP (better contact), but I’d be curious as to how strong it is.

Thanks!

by MjwW on Jan 17, 2012 2:18 PM EST reply actions  

I was curious about the LD% and IFFB% coefficients too.

It might be because LDs have almost no correlation from year to year, and the data was regressed against next year BABIP. I’m not sure if IFFBs have a correlation from year to year, but looking at the results of this regression, I’m guessing they don’t. If I regressed against same season BABIP, I would get much different results. It also might be because the league average LD% is much higher than the league average IFFB%.

I know, it seemed very strange, I’m very suprised it did so well in tests.

Standard errors are
LD% 1.325233629
GB% 1.325356677
FB% 1.325166464
IFFB% 0.016959096
HR/FB 0.01236696
IFH% 0.026090601
T-stats are
LD% -1.337897222
GB% -1.38497345
FB% -1.461443874
IFFB% -10.13107194
HR/FB 5.706834728
IFH% 5.711062212

by Bososx13 on Jan 17, 2012 2:32 PM EST up reply actions  

You know what it is...

I figured it out why you have such a funny intercept and negative correlations – it actually should have jumped immediately to my mind.

Because you’re including all the possible outcomes of a batted ball – LD, GB, FB, (assuming you are using Fangraphs data, IFFB are a subset of FB), the intercept term cannot properly calculate the average residual (which is what it’s supposed to do). I don;t know your stats background, but it a creates a “vector of one” and perfect multicollinearity, which wrecks the results.

It also explains why the standard errors are so funny – this is what flagged my memory. Intuitively, we would expect LD%, GB% and FB% to be incredible inportant, but your results indicate a very low level of significance (we would not reject the null hypothesis that they have no impact, in fact). Instead, the IFFB% HR/FB and IFH% are very significant, which can make sense, but they shouldn’t be more important than the first three (at least my expectation would be this). The reason is, they don’t suffer from this collinearity problem – their coefficients are effects, but the regression can tease out their effect.

The reason for the funny IFFB% numbers is because of the way Fangraphs accounts for them – they are a subset of FB%. In other words, if a player has a FB% of 40%, and IFFB% of 20%, it means 20% of the 40% of batted balls are IFFB, and 32% are outfield flyballs. So, when you’re assessing the effect of a IFFB seperate from a FB, you have to add the coefficients for both, and you end up with a largest total coefficient, which is intuitive. It’s also why the t-stat is so significant for IFFB% – if doesn;t have the same collinearity problem If you are not using Fangraphs’ data (or a similar way of counting), I have no idea what’s going on.

So, the multicollinearity need to be solved to get meaningful numbers. Luckily, It’s quite easy to fix in this case. You should run the numbers again, but without an intercept. The alternative is to dump one of the batted ball types, but that will make the interpretation funny.

by MjwW on Jan 17, 2012 3:57 PM EST up reply actions  

Third paragraph correction

The reason is, they don’t suffer from this collinearity problem – their coefficients are effects affected, but the regression can tease out their effect.

by MjwW on Jan 17, 2012 3:59 PM EST up reply actions  

so maybe I should do OFFB%

FB%-IFFB% instead of FB%? I’ll try this.

by Bososx13 on Jan 17, 2012 4:01 PM EST up reply actions  

Okay,

I did OFFB%, and the coefficients look normal. But calculating the xBABIP, all the xBABIPs are negative. It’s weird, and the correlation got much worse.

by Bososx13 on Jan 17, 2012 4:13 PM EST up reply actions  

I made it to the average, now the RMSEs

got a lot worse. Why did this happen? All I did was switch out FB% for OFFB%

by Bososx13 on Jan 17, 2012 4:22 PM EST up reply actions  

If all you did is switch out FB for OFFB%

And kept the intercept, you’ve still got perfect collinearity, so the coefficients will be weird. You have to dump the intercept in this case in order for OLS regression to give good coefficients.

As to why the RMSE got a lot worse – well, it may be the the “stuff” in Slash12’s model has more explanatory power. I recall looking at it previously in passing but I’ve never used it extensively. The problem with your previous results (and possibly, if you still haven’t dropped the intercept) is that the collinearity problems eseentially mean that the coefficients you got are meaningless in terms of interpretation, so you shouldn’t be comparing them to the “old results” anyway

by MjwW on Jan 17, 2012 4:42 PM EST up reply actions  

Okay, I did it just with the raw batted ball numbers

and divided by contacted balls. The correlation with next year went up to .44, even though I used all players with more than 40 PAs and for the old one, I just used qualified players. I’ll change the link in the original post soon

by Bososx13 on Jan 17, 2012 4:50 PM EST up reply actions  

With your new numbers

Can you also please post the t-stat as well? It just makes it easier to interpret the data.

Thanks

by MjwW on Jan 17, 2012 5:16 PM EST up reply actions  

Sorry

That should be t-stat for each coefficient

by MjwW on Jan 17, 2012 5:16 PM EST up reply actions  

T-stat

GB% 13.84840397
FB% 18.30712049
LD% 19.8746587
IFFB% -10.00614451
IFH% 8.123351989
HR/FB 4.181577012

by Bososx13 on Jan 17, 2012 5:29 PM EST up reply actions  

Yeah, that looks right

Intuitively, it follows what I would expect, in terms of what is most important and what is less variable. This is the first thing I always look at when I run a model, are the errors terms significant, and do they match what I’d intuitively expect.

by MjwW on Jan 17, 2012 6:10 PM EST up reply actions  

I did a new test

When my xBABIP disagrees with slash12’s by 20 or more points, mine gets closer 70% of the time.

by Bososx13 on Jan 17, 2012 3:33 PM EST reply actions  

That was for next season

for same season, mine wins 54% of the time.

by Bososx13 on Jan 17, 2012 4:00 PM EST up reply actions  

Ok the new tests

Next season
mine
correlation
.44
RMSE
.045
MAE
.033
slash12’s
Correlation
.22
RMSE
.057
MAE
.041
RMSE
Same season
Mine
Correlation
.59
RMSE
.041
MAE
.031
slash12’s
Correlation
.53
RMSE
.047
MAE
.032
When disagreeing by more than 20 points
My model comes closer 63% of the time when projecting the future.
When disagreeing by more than 20 points,
my model comes closer 54% of the time when projecting same season performance

by Bososx13 on Jan 17, 2012 7:18 PM EST reply actions  

same years in regression?

What years did you include in your regression? Then what years did you test? If you test against years that are included in your regression, your results will not be reliable.

by slash12 on Feb 8, 2012 1:26 PM EST reply actions  

Comments For This Post Are Closed


User Tools

We use numbers and stuff.
Community Guidelines
Why be a member?

Follow us on Facebook!

Follow us on Twitter!

SaberGraphics

Yahoo_full_count

MLB Daily Dish

Get the latest MLB Trade Rumors, Transactions, and News at MLB Daily Dish!


Managing Editor:

Jbopp-kc_small Justin Bopp

Columnists:

Adam_small adarowski

Dme_small Satchel Price

Closeup4_small J-Doug

Carlosicon_small Julian Levine

Billy_and_daddy_4th_of_july_small Bill Petti

Featuring:

Dayton_small Jeff Zimmerman

12475953_small Jacob Peterson

Recent_pic_pg_small Patrick Gordon

Btbpro_small Dave Gershman

Me_small Bryan Grosnick

229331_10150183361996591_674441590_6760167_6637860_n3_small Lewie Pollis

Img_3830_small David Fung

30472_1481067225243_1190689185_1381415_997334_n_small Glenn DuPaul

1mnvxku7_small joshuaworn

Set_small MattFilippi18

Photo0011_small Nathaniel Stoltz