a new xBABIP calculator
I've been a big fan of the hardball times xBABIP calculator over the last 6 months or so, but there were a couple of things that I didn't like about it. The first thing I didn't like, was having to stick in exact numbers for AB's, HR's, etc. When dealing with projections, I much prefer to work in percentages. With percentages you can see what their BABIP for a partial season, or even a span of several years, or a career much easier. I also am not so sure about the inclusion of stolen bases as a statistic.
I'm a big fan of the fangraphs website, and they provide a wide array of batted ball data for each player. I determined that BABIP is very strongly determined by a combination of LD%, GB%, FB%, IFFB%, HR/FB%, and IFH%. That is to say, as much as BABIP can be. This is right along with what the hardball times uses, except in my case, I'm dealing strictly with percentages, and I've substituted in IFH% as opposed to SB's. It's worth noting, that I'm not taking into account ballpark factors (which surely have some kind of effect on BABIP as well).
I came up with my numbers, plotting a large amount of data (3 years worth of individual player statistics), and doing a multi-variable regression analasys on it (I'm not sure if that's the right wording or not, I have no formal training in statistical analsys, just some stuff I've picked up).
Here's the equation I came up with:
xBABIP =0.391597252 + (LD% x 0.287709436 ) + ((GB% - (GB% * IFH%) ) x -0.151969035 ) + ((FB% - (FB% x HR/FB%) - (FB% x IFFB%)) x -0.187532776) + ((IFFB% * FB%) x -0.834512464) + ((IFH% * GB%) x 0.4997192 )
Here's a published view of a spreadsheet showing it in action:
http://spreadsheets.google.com/ccc?key=0AuaVTUnZda7fdFVpY2NoRC1zS1p0UlNPaDlVdlRhN1E&hl=en
Here's a download of the spreadsheet in open office (Forgive the lame hosting service, I wasn't sure where to upload):
http://www.filefactory.com/file/a1a2d5a/n/public_xBABIP_Calculator_ods
I've been using the following calculator (along with a number of other equations) to build my own projections for 2010, and here are a few of the interesting things I've noticed.
First off, LD% has a very strong correlation to BABIP (not exactly a revolutionary statement), but it's also very hard to project it seems. There seems to be a lot of luck built into it, so even taking career LD% rates is still factoring in some luck, so I tend to trend them closer towards the league average (19.5).
GB% is a little easier to predict Higher GB% tend to yield higher BABIP's, but that's based on your IFH% as well. A player who can post high IFH% with a lot of ground balls will greatly increase their BABIP, while a slow player with a terrible IFH% with a lot of GB% won't increase their BABIP nearly as much (makes sense).
FB% is again easier to predict then LD% typically, and high FB% tend to yield lower BABIP's, as they are more likely to record outs. But you've got to look at HR/FB, and IFFB% as well to get an accurate picture. A player who hits a ton of fly balls, but has a very high HR/FB rate, with a very low IFFB% (ryan howard), can post more respectable BABIP's (they have a better shot of landing if they are getting out of the in field)
HR/FB is also a little easier to predict, and doesn't directly effect your BABIP, it's only used to take the home runs out of your fly balls (which in turn helps your BABIP). One thing that strikes me as problematic here, is line drive home runs.
IFFB% seems somewhat player controlled, but also has a large luck component to it from year to year (probably largely due to sample size). This has a definite impact on your BABIP, as fly balls on the infield are automatic outs.
IFH% seems very speed dependant. The more in field hits you have, the higher your BABIP as well. This can vary from year to year with luck, but generally speedy players will post better (there are a few notable exceptions, like jason bay's abnormally high IFH%, which I chalk up to some luck) numbers. Ballpark factors play a role here I'm sure as well (which I'm not accounting for).
So in the end, what we get, is a way to take numbers directly from fangraph (over the course of a career, full season, or even partial season), and get a descent idea of what their BABIP should be like, and how lucky they have been. As always, this will still vary a lot from year to year (and the BA, OBP, and SLG along with it), but this is an attempt at trying to get an idea of what that middle number, that the BABIP will fluctuate around is for a given player. Outside of using a calculator like this one, or the hardball times, the next best way to evaluate BABIP is probably to look at a players career numbers, but even those are prone heavily to be skewed by some lucky streaks.
I'm very interested in any feedback/critique that anyone has to offer, or any ideas on improving it. I've also got a number of other calculators (one that does batting average, xHR, xR, xRBI, xSB, xAvg, xOBP, xSLG, that I'd be willing to throw out there as well, but I figured before I went through the trouble, I'd see what kind of buzz I get from this one.
1 recs |
23 comments
Comments
works for pitchers too
Another big advantage of this method of calculating xBABIP is that it works easily for pitchers as well.
by slash12 on Nov 12, 2009 1:17 PM EST reply actions 0 recs
I'm surprised
That nobody’s commented on this yet. I’ve hesitated to say anything because I don’t find myself qualified to comment on the mathematical gymnastics involved, so I can’t offer any true critique that’d help you out.
The one thing that I would be VERY interested to see is how well this compares to Bendix’s method. Perhaps you could run a study on the two?
by Anticon23 on Nov 12, 2009 4:44 PM EST reply actions 0 recs
Looks fine
I’m not sure a multi-variate regression is the right approch, you should at least do some individual tailoring, however, you can’t really go wrong.
Like Anticon, I would like to see this compared with the Bendix method.
by vivaelpujols on Nov 12, 2009 5:40 PM EST reply actions 0 recs
This looks similar to an article I wrote at StatSpeak before I joined BP— we clearly came to a lot of the same conclusions separately which reinforces that both of our ideas are probably right.
MVN had some trouble and now the old articles’ formatting is all messed up, but you can read through it if you like.
http://statspeak.net/2009/05/10/improving-babip-projection-by-batted-ball-types.html
The key additions are to look at BABIP on individual batted ball types. BABIP on ground balls and fly balls are more persistent than on line drives. BABIP on line drives is correlated with power, but at a declining rate, so it’s best to model HR/AB is a logarithmic term. I also included historical ROE/GB so its just (IFH+ROE)/GB to avoid the problem with scoring issues. Additionally, contact rate is correlated with BABIP on ground balls. So I ran regression of BABIP on LD%, GB%, GB-BABIP, IFFB%, OFFB-BABIP, LN(HR/AB), LN(contact), and (IFH+ROE)/GB.
The model predicts BABIP better than xBABIP, and other projection systems that I’ve tested it against, including just looking at 2009 which wasn’t included in the data obviously when I published this in May. It’s probably a good starting point if you want to predict BABIP in the future rather than do what xBABIP does which is look at things retrospectively and approximate what BABIP should have been. The thing is to remember that hitters differ strongly on their BABIP by batted ball types.
by Matt Swartz on Nov 12, 2009 9:10 PM EST reply actions 0 recs
Both the other methods supplied all look very interesting, but it requires a lot of data that I don’t have on players(and I’m not sure how to get). The method I developed can very quickly determine an expected BABIP based on some stats that are readily available via fangraphs, for both batters, and pitchers. I’ve been playing with it a lot, while doing my 2010 projections, and I’ve been fairly happy with it’s results. There are a few outliers that I’ve noticed though, that are probably worth mentioning:
Matt Kemp: my system projects him with a .347 BABIP, which is well below the numbers he’s been posting, is this a flaw in my system, or is kemp just getting lucky?
Ryan Howard: my system projects him for a .349 BABIP (due to his high HR/FB rate, and extremely low IFFB rate, and high LD rate. Has howard been unlucky with balls in play? or is this a flaw?
Nyjer Morgan: my system projects him for a .328 BABIP (low LD%, relatively low IFH% (compared to other speedsters)). Has nyjer been lucky thus far in his career? or a flaw?
Brandon Phillips: my system projects him for a .310 BABIP (lots of ground balls, and a descent IFH%), has he been unlucky in recent years? or another mistake?
Besides these anomalies (or are they?) the other thing I’ve noticed, is that IFH%, IFFB%, and LD% are all vary greatly from year to year themselves (via luck), so even breaking BABIP out into this batted ball data, it’s still difficult to project accurately, but it’s still better then just looking at BABIP itself. You can do things like "this guy has really good speed, and bats left handed, so I tend to believe more in his higher IFH%, and tweak accordingly.
by slash12 on Nov 13, 2009 9:58 AM EST reply actions 0 recs
What's with the bizarre number of decimals?
I don’t think you really need to find projected batting average to the thousandths of a point.
Linda's in the cold ground, won't see her anymore
Somewhere out on the highway tonight, the drunken engines roar
It's just one of those things, one of those things
-- Al Stewart, "Accident on 3rd St."
In memory of Nick Adenhart and all victims of drunk driving
by PaulThomas on Nov 13, 2009 4:00 PM EST reply actions 0 recs
comparing to the bendix method
I found the following post:http://www.hardballtimes.com/main/article/batters-and-babip/
In comparing my results to theirs, They get an R square around 35%, my results give me an r square of 44%. So it would appear that basing BABIP on these stats does indeed do a pretty good job.
Of course, the nice thing about their method, is that it’s based on more skill based stats, while LD%, IFH%, IFFB%, are still fairly luck based statistics (and vary a lot from year to year themselves).
But this method should be an interesting way to project at a glance, a batter, or pitchers expected BABIP.
by slash12 on Nov 17, 2009 9:01 AM EST reply actions 0 recs
I assume you are testing your method against Year N+1?
by vivaelpujols on Nov 17, 2009 11:36 PM EST up reply actions 0 recs
You are testing xBABIP...
Based off of how 2008’s number project 2009’s – that’s the only way to do it.
by vivaelpujols on Nov 18, 2009 10:35 PM EST up reply actions 0 recs
Nick is right
Can’t compare it to the stat you are trying to correct
by Zach Sanders on Nov 21, 2009 6:35 PM EST up reply actions 0 recs
comparing to other methods
And another article: http://www.hardballtimes.com/main/fantasy/article/whats-the-best-babip-estimator/
At 44%, again, this method shows to be the most accurate (Again, disclaimer is, it’s hard to project what a persons true LD%, and IFFB, IFH% are).
by slash12 on Nov 17, 2009 9:08 AM EST reply actions 0 recs
Year N+1 Results
I ran this comparison, and here’s what I found:
First, I compared the difference between a batters BABIP from 2008 and 2009, and his 2008 xBABIP vs his 2009 BABIP.
51% of the time, the xBABIP was closer then just using the actual BABIP was.
Next, ignoring the range of difference, I totaled up the number of times that xBABIP accurately predicted that a batters BABIP would rise, or fall the following year.
65% of the time, the xBABIP accurately predicted a rise/fall the following year.
I used a sample of about 100 players who had a minimum 500 plate appearances in both 2008 and 2009.
So what does this mean? Well, I do feel that the batted ball data that I’ve identified has a very strong correlation with a players BABIP. Unfortunately, that same batted ball data will vary greatly from year to year itself, making those stats themselves, impossible to predict. So what good does this do?
Well, it lets us break up BABIP into something that makes a little more sense to us, let me explain:
LD%: Based on my research, more often then not, a players career LD% fluctuates around the 19.5% mark. So, if the batter you are looking at has a LD% that’s over, or under that, it seems to imply that they have been a little lucky/unlucky with their flyballs. You can regress that number closer to the 19.5 mean, and odds are, you’ll have a better guess at his LD% for the next year
GB%/FB%: In most cases these remain fairly consistent over the course of a players career, i’ve found these are easier to predict more accurately, with a few noticable exceptions (sometimes strangely a player will flip flop and become a flyball hitter one year)
IFFB%: Players with better power tend to post better (lower) IFFB% over the course of their career. Overall, this stat tends to fluctuate a lot, but if you notice an IFFB% that stands out from the rest of their career, or fluxuates greatly from the league average (8.9), then you can regress it towards the mean.
IFH%: Better for left handed hitters, and much better for speedy players. You can rule out a flukey low IFH% and regress it towards a career average, or towards the mean (5.97 league average).
by slash12 on Nov 23, 2009 3:55 PM EST reply actions 0 recs
What was the year n+1 correlation?
by vivaelpujols on Nov 23, 2009 6:46 PM EST up reply actions 0 recs
That seems really low
are you sure?
by vivaelpujols on Nov 24, 2009 12:36 PM EST up reply actions 0 recs
not really a relevant analysis
The more I think about it..the more I think it’s pointless to do a year n +1 analysis. This isn’t a way to predict BABIP for next year, it’s a way to predict BABIP using batted ball data. I see this mostly as a projection tool, not a definitive “this is going to be your BABIP next year” tool, a smart person can use this tool to create that. The main reason for this, is that your batted ball data itself is going to vary greatly from year to year via luck (though there is some skill element to it as well, as certain high LD%, IFH%, low IFFB% batters are evidence of).
Now that said, I do think it’s a very useful exercise for anyone doing projections (including myself) to take their projected statistics, and do the year N+1 comparison. But for me, that’s not the persons Year N xBABIP, it’s an xBABIP based on a different set of batted ball data, that I generated myself using a combination of regressing to the major league average/career average means. I’ve begun to do so, with a few players projections, and thus far have been pleased with the results, but I by no means have a sample large enough to run any definitive statistics on yet.
by slash12 on Dec 1, 2009 12:52 PM EST up reply actions 0 recs
Think about it this way
A players BABIP is one year is made up of true skill + random variation (which includes luck also). A players BABIP in Year N+1 includes true skill + random variation. So if you have a model that attempts to measure what a players true BABIP skill is Year N, you need to test that by seeing how well it predicts BABIP in Year N+1, because you are projecting a players’ BABIP skill.
That’s the only way to test it. You can’t test it on the sample from which you best fitted your equation to (you just did a multivariate regression, which which inherently be the best fit to the data you ran it on), so in order to see how well your model actually measures skill, you have to test Year N+1.
by vivaelpujols on Dec 1, 2009 7:40 PM EST up reply actions 0 recs
that sounds logical....but...
What I’ve developed isn’t a measure of a players true BABIP skill. It’s a way of breaking up somebody’s BABIP into some statistics that make more sense (and thus, themselves can be projected a little more logically).
For instance, just looking at somebody with a high BABIP doesn’t tell you a lot, but if you drill into his batted ball statistics, you can identify things like “ok, part of his high BABIP is that he’s got a high IFH%, and hits a lot of groundballs, and he’s fast, so that makes sense”, or “He drives the ball well, and has a high line drive percentage, and low IFFB%, that makes sense”. Using that kind of logic, you can also identify flukes in a person’s BABIP, such as “OK, that guy is not fast at all, yet he had a 10.1 IFH%?? That’ screams fluke, so I should regress that statistic downward, and using this equation you can see how that will effect his BABIP”.
by slash12 on Dec 2, 2009 10:32 AM EST up reply actions 0 recs
Another exersize
Really quick I took 3 players, and attempted to ignore the existence of 2009, and using the ideas I outlined in my last post, and my xBABIP calculator, posted my results below.
Soto: 14.7 HR/FB 20.25 LD% 37.7 GB% 42.05 FB% 7.65% IFFB% 5.97 IFH% Predicts: .319 BABIP
Actual 2009: .251
Comments: I predicted it would go down (most would have), and it did, but it did so by a LOT!. Drilling into the numbers of 2009, we can see why: 18.1% LD% 4.8% IFH% His line drive went down, as I expected, but by much more. And his IFH% went down even more then I predicted, though I can’t say this is really all that unexpected, given his speed.
Maur: 8.5% HR/FB 21.5 LD% 49.2 GB% 29.3 FB% 2.4 IFFB% 3.6 IFH% Predicts: .335 BABIP
Actual 2009: .377
Comments: I incorrectly predicted it would go down, I thought his LD% would regress further then it did, and I couldn’t have predicted his spike in HR/FB% (which helps your BABIP, because that’s less fly balls in play), his IFFB% also went lower then I would have predicted.
Votto: 18.5 HR/FB 22.35 LD% 44.1 GB% 33.55 FB% 3.8 IFFB%, 4.34 IFH% Predicts: .342 BABIP
Actual 2009: .373
Comments: His IFH% did indeed go up, as I expected it should have, while his IFFB% improved to 1.4% which helped aid in a big increase in his actual 2009 BABIP over my predicted one.
This is a very small sample, but it gives you an idea of how people (even smarter people then I) could leverage the ability to predict a BABIP using batted ball data.
by slash12 on Nov 23, 2009 4:24 PM EST reply actions 0 recs
Correlation calculations corrected
OK, after learning a little more about how to do correlations correctly, I came up with the following results:
Comparing 2008 BABIP to 2009 BABIP: .35 correlation
Comparing 2008 xBABIP to 2009 BABIP: .48 correlation
It turns out that this is indeed a better method of predicting BABIP then simply using last years BABIP. Of course, in the grand scheme of things, .48 still isn’t that great. What I’m most interesting in discovering at this point, is how to better predict what a persons IFFB%, and LD% is going to be (based on skills) since those things vary so much from year to year, and have such a large impact on your BABIP. If I can figure this out, I’ll have a much better idea. Better still, would be to introduce spray into the equation (but I can’t figure out how to get spray data).
by slash12 on Dec 8, 2009 10:11 AM EST reply actions 0 recs
Another exersize
Using this formula, and the 2008/2009 data, I adjusted the 2008 data towards the mean for each of: IFFB%, IFH%, and LD%, and used the true 2008 GB% and FB%. Doing this, got me up to a .57 correlation. Further analyzing the results:
65% of the predicted BABIP’s were within 30 points of the actual
55% were within 20 points
25% were within 10 points
14% were within 5 points
and 80% of the time, the xBABIP predicted correctly whether the BABIP would rise or fall the next year.
by slash12 on Dec 8, 2009 3:25 PM EST reply actions 0 recs

by 









BtB on Facebook















