The arrival of the 2015 MLB season was supposed to coincide with the arrival of vast amounts of new statistical information on the game we all love in the form of MLB Statcast, which will purportedly track essentially everything on the field at all times. While that hasn't quite materialized yet — Statcast's on-screen presence has been non-existent, and there's been no formal release of data — some new information has been seen in the same directories that serve as a source for public PitchFX data. Specifically, there now exists some, though certainly not complete, data on batted ball velocity — that is, batted ball distance, speed-off-bat, and both vertical and horizontal angles. (It's not clear yet if the horizontal angle is the initial trajectory or the heading of the final data collection point.)
One of the biggest drawbacks of doing research on batted balls, especially where you want to separate by batted ball type, is the inherent reliance on stats stringers for batted ball classification. Colin Wyers wrote a terrific article for the Hardball Times in 2009 on the effect that pressbox placement has on the rate of line drives scored (versus fly balls). Harry Pavlidis also addressed the issue of stringer bias in a few articles, also at Hardball Times. Whether it's an effect of the physical location of the data recorder or the stringer's own bias isn't important, though — the point is that relying on subjective classification rather than objective measurement reduces the utility of what the research finds. Further, the newly-available data, if/when it becomes available for all batted balls, could allow researchers and advanced-stats-type people to drop the existing batted-ball profile standard (LD%, GB/FB ratio, etc) in favor either of a continuous function that describes trajectories or a more refined set of classes that arises naturally from the data.
Since I only have the data from a small set of games, it would be complete ridiculous to try to draw any conclusions about individual players, either batters or pitchers. What I hope you won't think it's too early for, though, is a look at league-wide numbers. I've got about 500 total data points to work with, which I'm 95% sure includes foul balls. Unfortunately, not all teams and parks are represented. I used R to look at the vertical launch angle and speed-off-bat of all 500-some first, then excluded about 25 that were well-separated from the others, below 60 MPH off the bat. Rather than creating a pure heat map, which would have been either very sparsely populated or had bins too large to be revealing of anything, I used the MASS library for kernel density estimation, which (briefly speaking) interpolates to account for absent data. For the sake of the graphic, I also cut off the axes to make a launch angle range of -40 to 45 degrees, and a speed-off-bat range of 65-115 MPH (this is a visual modification only, not a data modification). Here's what it looks like:
So, obviously there are two spots that stand out. I isolated the data responsible for each and ran the same process again, to see if any separation within the spots would occur. For the lower, faster balls, they stayed basically uniformly distributed around the same center. For the higher and slower spot, though, three distinct groupings became apparent:
Now, what all of this means (if anything) isn't clear yet; that's going to take a LOT more data, which MLBAM may or may not be providing to the public. My long-term hope, as I said above, is that researchers can use these data to develop more sophisticated batted ball profiles, much better park effects models (my particular area of interest), and things I can't even imagine yet. For now, though, I guess I hope the data merely exist for the public (definitely not guaranteed) so we can keep learning more about the game we all love.
John Choiniere is a researcher and featured (occasional) writer at Beyond the Box Score. You can follow him on Twitter at @johnchoiniere.