clock menu more-arrow no yes mobile

Filed under:

What can be learned from batted ball data?

Batted ball data. So hot right now. Batted ball data. But does it tell us anything we don't already know?

Jason O. Watson/Getty Images

Batted ball velocity data has been very fashionable lately. It seems like a day doesn't go by without a tweet like this going out:

If you're like me, your response is something along these lines:

With the prospect of comprehensive StatCast data possibly arriving sometime this season, quality of contact is on everyone's minds. Whether it's the nascent StatCast data that is available, or the already existing available metrics, there's a lot of conversation around which hitters are making hard contact and which aren't (and which pitchers are generating soft contact and which aren't). That said, there's not much of a consensus around what that means, either directly (in terms of what's being measured) or indirectly (how much does what's being measured matter?) There have been a few attempts at answering the first question. Most recently, Jeff Sullivan wrote an article over at FanGraphs showing both that the exit velocity data is not always accurate and that even when it is accurate, the contact generating it doesn't always look like we'd expect it to.

I'm going to try to answer the second question - how much does this matter? Now that all this batted ball data is easily accessible, what conclusions can be drawn with it? Does it contribute meaningfully to our understanding of different statistics? I'm going to create some "baseline" models that aim to explain a player's BABIP using only data other than quality of contact. Then, I'll incorporate the various sources of batted ball data into the model to see if it improves and by how much.

BABIP makes a good dependent variable for a few reasons. Firstly, it's the most intuitive application of quality of contact or batted ball velocity data. Sure, a .360 BABIP seems like it's unsustainably high, but if the player is in the top tier of quality of contact, it seems like he'll be more likely to sustain that going forward. BABIP also has the advantage of, in the aggregate, being fairly explainable by a simple regression. August Fagerstrom wrote an article for FanGraphs that regressed BABIP for players with more than 1,000 ABs since 2002 on speed, IFFB%, and LD% and got an R-squared over .50. That's pretty good! There's definitely less noise in the BABIPs in that sample than in a single season, but it still shows that you can get a meaningful model without too much effort.

The regressions created in this article will probably end up looking fairly similar to that, with a couple of differences. First, I'm going to look only at a single year, so I can compare across some of the quality-of-contact metrics. Second, I'm not really too concerned with the model's performance. Obviously, if the base model has an R-squared of literally 0, that might be a problem, but I'm interested in the improvement from adding the quality of contact data, not the actual explanatory power.

There are several different sources for data on the authority with which a player hits a ball. I'm going to cover what strike me as the major ones - BIS, Inside Edge, and StatCast - with a brief explanation of each one and an evaluation of how the model changes with that data. The non-contact data (BABIP, batted ball classifications, etc.) all come from FanGraphs.

Every base model regresses player BABIP on four variables - speed score (SPD), GB%, LD%, and "true" FB%, i.e. fly balls without infield fly balls. This is implicitly using IFFB% as well; if some number of variables that always add up to 100% are all included in a regression, one of them will get omitted by the regression anyway. Those seem to be the obvious explanatory factors of BABIP and should do a fairly good job of setting the baseline.

Baseball Info Solutions/FanGraphs

The first source for quality of contact is Baseball Info Solutions, newly available at FanGraphs.  Each player has a Soft%, a Med%, and a Hard%, which I hope I don't have to explain. Here's what David Appelman wrote about their addition:

It's important to know that these are all relative to the batted ball type. For instance, a fly ball might be classified as hard, but if that ball were a line drive, it could potentially be classified as medium. If you are interested in seeing how line drives/fly balls/groundballs are classified into soft/medium/hard, we have that information available on the splits pages.

It's also important to know that prior to 2010, these were all graded visually. From 2010 onward, the batted ball type, hangtime, and distance hit are all used to calculate the soft/medium/hard classifications.

These are definitely a little opaque, and without knowing exactly what the thresholds are for each one, it's hard to know how much to trust them, so let's try to find out. First, the baseline model, considering the 146 qualified batters from 2014. Just using GB/LD/FB% and SPD, the basic regression has an adjusted R-squared of .3012, indicating that in 2014, batted ball mix and speed explained about 30% of the variance in BABIP among players.

So, that's without quality of contact data. What does the BIS data add? First, I included both Med% and Hard% in the regression, and the adjusted R-squared increased to .4041, indicating about 33% more explanatory power than the base model. This is a good start, as that's a very substantial improvement! If only Hard% is included in the regression, the R-squared falls barely to .4009.

So the BIS data appear to be helpful for predicting BABIP. Just by adding in the rate of hard contact, the explanatory power goes up a substantial amount. It does appear to overlap with some of the batted ball classifications: while Hard% is significant in the model, GB% and FB% lose much of their significance, indicating some overlap between the information provided by those stats. Regardless, this is a very good result for the usefulness of the BIS data and indicates that something meaningful is likely being tracked.

Inside Edge

Mark Simon, of ESPN Stats and Info, sends out tweets that look something like this every once in a while:

Those numbers are interesting and fun to look at, but it's hard to tell if they mean anything more than that. First, while he sometimes shows how all the teams are doing, I've never seen a full list of every player, and without knowing the gap between the best and the worst, it's difficult to know what to make of it. There's basically only an ordinal ranking and a little bit of quantitative info. Secondly, there's the same opacity of data and measurement that's present with the FanGraphs data. I believe the source is Inside Edge, and they're subjective categorizations, similar to Inside Edge's defensive data, but that could be completely wrong.

For a source, I'm using the above tweet, which had the top 61 batters in Hard% of 2014. Not all those hitters got enough ABs to qualify, so our sample ends up being 45 players, also known as comically small. Again, this gets at some of the limitations of the data, but I'm going to ignore some rules and go ahead like this is actually a substantial sample.

Regressing the BABIP of those 45 players on SPD, GB%, LD%, and FB% spits out an R-squared of .2546. Unsurprisingly, the smaller sample size reduces the explanatory power of this model substantially. Adding in the Hard% from Simon's tweet increases the R-squared to .2607, or basically not at all.

This would seem to indicate that this data is almost useless in its current form, or at least as an analytical tool rather than a fun fact. It's possible that if it was publicly available for all players, the data would serve a valuable purpose, but this sample is both too small and too unrepresentative. Because these are all the league leaders, there just isn't much spread or variation, and that keeps these snippets firmly in the realm of "interesting" rather than "analytically useful".


Finally, we have average batted ball velocity, as recorded by StatCast and made available at Baseball Savant. This data, like the Inside Edge data, is incomplete, and it's also completely new. It's not clear yet how accurate it is, or if there are calibration issues between parks like there sometimes are with PitchF/X. But it exists, in some form, so we might as well try using it, and see what (if anything) it adds.

I'll take all players with at least 100 ABs through Sunday;s games (174 players) who also have had at least 10 ABs recorded by StatCast (down to 131 players). In case you weren't already skeptical of this data, 43 players with more than 100 ABs had less than 10 StatCast ABs, but the lowest number of StatCast ABs among the remaining 131 is 34. Because of the nature of the rollout, I'm guessing certain teams are being recorded far more often than others, so this is probably not a representative sample.

With less than 20% of the season done, there's still a ton of noise in BABIP, but again, I'm not concerned so much with the explanatory power of the model as the added explanatory power of StatCast. If we run the base regression, BABIP on FB/LD/GB% and SPD, we get an adjusted R-squared of .1946. Again, the lower R-squared is expected; there's a lot of noise in BABIP this early in the season. Adding in the StatCast data increases the R-squared, but only barely to .2035, or less than a .01 increase. Because that was so disappointing, I tried upping the threshold to 50 StatCast ABs, which 94 players have had. The adjusted R-squared of the base model on this set was .2123, and the R-squared of the model using StatCast data was .2247, another less than .01 increase. In other words, nothing.

As with the Inside Edge data, these results suggest that the StatCast data is mostly useless as a serious tool of analysis right now. Unlike the Simon data, there's a reasonable expectation of a higher quantity and quality of data in the near future, so don't consider this a referendum on StatCast as a whole. But given what we have right now, there doesn't appear to be much it can tell us that already-existing stats can't.


What can be taken from all this? The BIS data from 2014 conveyed useful information about BABIP and will probably do the same for 2015. Identifying players over- or underperforming their expected BABIP based on their speed and hard-hit rate seems like it wouldn't be statistically irresponsible, though there are obviously unanswered questions of sample size. The same cannot be said about the limited glimpses of Inside Edge data - without a comprehensive, league-wide dataset, the rates just don't convey enough information to have any analytical power. Finally, the StatCast data (unsurprisingly) does not appear comprehensive or meaningful enough to draw any conclusions.

nathan velo

It's possible that a different model might add even more explanatory power. Alan Nathan (@POBGuy) tweeted this fascinating plot of the linear weights of the different batted ball velocities, showing that medium-speed balls are actually worse than softly-hit balls, since a speedy batter can sometimes beat a squibber out for a hit, while the same isn't true of a firm grounder. While three pieces of contact at 80MPH have the same average as three at 50MPH, 80MPH, and 110MPH, the second group would probably generate more hits. For that level of analysis, though, the current StatCast data is woefully inadequate. That may change in the near future, but until then, I suspect simple models will have to suffice.

So that's where we seem to be at right now. The BIS data on FanGraphs is useful, but don't despair if your favorite player isn't on a Mark Simon tweet or is at the bottom of the StatCast leaderboard, and don't try to use their average hit velocity to explain a low BABIP. Until much more granular data is available, or the processes behind the qualitative data are more clearly explained somewhere, the explanatory power of this data is limited at best.

. . .

All statistics courtesy of FanGraphs, ESPN Stats & Info, and Baseball Savant.

Henry Druschel is a Contributor at Beyond the Box Score. You can follow him on Twitter at @henrydruschel.