clock menu more-arrow no yes mobile

Filed under:

The early flaws of Statcast data

Statcast is amazing, but it's not ready for detailed public analysis.

Robert Mayer-USA TODAY Sports

It was about 18 months ago that baseball fans got their first look at Statcast, at the 2014 Sloan Sports Analytics Conference, and the excitement was immediate. Not much was shown -- only a couple of videos, no more than 5 minutes long in total -- but that was enough to send viewers dreaming. It seemed that the player tracking system would revolutionize the way fans engaged with baseball, ranging from deep analysis to the immediate experience of watching a game.

Unsurprisingly, analysts were not isolated from this optimism, given the obvious applications of the new system. BtBS's own Bryan Grosnick described Statcast as "a prospect tearing up the back fields in spring training... raw, but easy to dream on." Ben Lindbergh, writing at Baseball Prospectus, pointed out that "the numbers in the Heyward video... are the real thing," quoting a VP at MLB Advanced Media confirming that the figures shown were actually calculated from Statcast data. Jonah Keri wrote for Grantland that "[Statcast] will allow fans, analysts, and all 30 teams to gain precise information that was previously out of reach."

That Jonah Keri quote comes from an interview with MLB Advanced Media President Bob Bowman, in which he said, "The goal is to put the product out this year [2014], then get to all 30 parks, then release the data in unvarnished form in 2015." That's not exactly what's happened. Currently, only batted ball data is being released on a regular basis, but the analytic public has still enthusiastically engaged with what is available. Articles and leaderboards featuring batted ball velocity are bouncing around the internet constantly, and efforts to predict breakouts or collapses based on who is over- or under-performing their contact authority are increasingly common.

There have been some nagging doubts, however. In May, I looked at the Statcast data that was available, in all its small-sample, uneven-rollout glory, and found that it wasn't communicating much meaningful information. The season and the system have progressed since then, but the data have remained incomplete. About 30 percent of balls in play are without associated Statcast velocity data and are presumably not being tracked.

For the most part, those issues have been unsurprisingly ignored, since Statcast is the shiny new thing, and 70 percent of batted balls is still a substantial sample. But esteemed Managing Editor of BtBS Neil Weinberg tweeted the following on Saturday, regarding a sample of batted balls he grabbed from recent weeks:

It's worth going back and reading the tweets and replies, because there's a lot of interesting discussion, including this from Dan Brooks (of BrooksBaseball fame):

As soon as I saw Neil's tweet, I was hooked. Tony Blengino just published an article at FanGraphs, in which he used a sample of Statcast data to document similar inconsistencies and compared the data to previous HITf/x data (which was available only to teams). The aim of this article is to a) replicate Neil's findings with full-season data from every team, rather than a limited sample from one team; b) confirm Dan Brooks's guess, if possible; and c) look for any other biases or patterns in what is getting tracked and what isn't. My hope is that these kinds of studies will help the baseball public know exactly what we're getting when using Statcast data.

All the data in this article come from the invaluable Baseball Savant. I pulled every ball in play for every team from the beginning of the season through August 2, over 84,000 events. Of these, 70 percent were recorded by Statcast, and the wOBA of those batted balls is .383. The wOBA for non-recorded batted balls is .285, meaning the full-season gap is even larger than what Neil observed in his limited sample. Clearly, Statcast's unreported 30 percent is not random.

The first thing I did was look at the rate of Statcast tracking across batted ball types, both because that's interesting in itself and to try to confirm the "cleanly hit balls" theory. In this chart (and each subsequent chart), the horizontal black line is the average for the whole sample.

Line drives, the batted ball type with the most authority behind it, leads with a 78.2 percent tracking rate. It's followed fairly closely by fly balls (77.1%) with ground balls (66.7%) further back. Popups, though, bring up the rear at an astonishingly low 39.4 percent, and the perils of just reporting a statistic like average velocity for a hitter are immediately apparent. Players with high popup rates will look better than they are, as more than half of popups (average Statcast velocity of 74.4 mph) go unrecorded, while a much higher proportion of line drives (92.6 mph), fly balls (90.0 mph), and ground balls (85.8 mph) are tracked.

While this is a real problem, it doesn't explain all the Statcast issues. Popups are the smallest of the four groups, and when they're removed from the sample, the proportion of tracked balls rises only 2.2 percentage points. The wOBA gap does fall substantially, with the Statcast wOBA at .397 with popups removed and the non-Statcast wOBA at .328, but a major difference remains. It's also not clear if popups are being missed frequently because of the generally low quality of contact behind them, or if there's something innate about their trajectory that is hard for Statcast.

Next, I looked from the pitcher's perspective. The following chart again shows the rate of Statcast tracking, this time across the pitch types thrown more than 1,000 times on the season thus far.

Overall, there's very little variation, with curveballs at the top differing from splitters at the bottom by about 4 percentage points, which would seem to suggest there's not much meaning here. But a 4 percent difference over a few thousand pitches isn't nothing, and if you squint, you can imagine a bit of a pattern. The pitch types with the highest rates are all offspeed or breaking pitches; maybe velocity has some impact on tracking rates.

I ran a logit regression with pitch velocity as the independent variable and the state of being tracked by Statcast or not as the dependent variable. Unsurprisingly in a sample over 80,000, pitch velocity was highly significant, with a z-score of -4. The coefficient was negative and signified that a 1-mph increase in velocity corresponded to a 0.54 percentage point decrease in the odds ratio of the probability of the batted ball having a Statcast velocity figure, or about .1 percent of direct probability. While pitch velocity might be significant, it isn't particularly meaningful. It seems Statcast's problems with tracking balls come after contact, not before.

So far, this article has established that Statcast is a work in progress, but I wanted to know if progress was actually being made. I separated the season into buckets of ten games (with a little flex on either side of the All-Star Break), and looked at the rate of Statcast tracking in each.

There has been some definite improvement from the beginning of the season, when the rate in the first 10 games was 58.0 percent and 64.5 percent in the first 30, but there is little change since May. I also wanted to look at the wOBA differential between tracked and untracked batted balls over the course of the season, thinking that even if Statcast isn't picking up a higher rate of batted balls than in April, maybe the bias in its selection process was decreasing. The following chart plots wOBA differential over the season, with the wOBA of tracked balls in blue and of untracked balls in green.

Instead, the wOBA gap has fluctuated randomly over the season. The largest differential did come in the first 10 days of the season, but each of the last four periods has been worse than average. Not only is Statcast in need of improvements, it looks like midseason tweaks haven't been made, or if they are, they aren't having a discernible effect.

After all this, there's still not much that can be definitively stated about Statcast. It's not working at full capacity, and the data that are provided is a biased snapshot of better-than-average batting outcomes. Different batted ball types are tracked at different rates, but whether that's due to their underlying characteristics or the quality of contact that leads to those types isn't clear. There's relatively little variation in tracking rates among different pitches, either when considering velocity or pitch type, and the system doesn't seem to be making substantial midseason improvements.

It's a bit frustrating, because what I'm most interested in is whether it truly is better contact that leads to a higher likelihood of being tracked, but the best data for quality of contact is... batted ball velocity, only available for the vast majority of balls in play through Statcast. I say vast majority, however, because there is one other source for batted ball velocity - the ESPN Home Run Tracker, which provides a variety of data points for every home run, including velocity.

Using the Home Run Tracker means the only batted balls being considered are home runs. With over 3,000, this is still a decent sample and much better than the alternative sample size of 0. It is obviously a biased sample, however, with no GBs or PUs and far to the right on the velocity spectrum. The Statcast tracking rate for home runs is high as a result, at 78.3 percent.  Because of this bias, if there isn't a clear trend in this dataset, that shouldn't be interpreted as evidence against the existence of such a trend among all batted balls. The following chart shows rates of Statcast tracking for different buckets of velocity, as reported by the Home Run Tracker.

Overall, there's nothing resembling a dramatic difference -- the lowest rate, coming in the greater-than-110-mph bucket, is separated from the highest, coming in the 95 to 100 mph bucket, by 4.5 percentage points. It appears as if balls in the middle of the range might be easier to track than those at the extremes, but those extreme buckets are also the smallest, with 67 home runs in the under-95 and 211 in the over-110, compared to 641, 1,288, and 807 in the 95-100, 100-105, and 105-110 buckets respectively. To visualize that distribution, here's a pair of histograms. The darker one is the count of all home runs tracked by Statcast, and the lighter one is the count of all home runs.

There might be a difference in tracking rate between the center and the edges, but eyeballing either of these charts isn't enough to confirm that. I ran another logit regression, using Home Run Tracker's velocity as the independent variable, and it was entirely insignificant in predicting Statcast tracking (z-score of .5). I also ran a logit using the distance from the mean Home Run Tracker velocity as the independent variable to look for evidence that Statcast was worse at tracking balls at the extremes, but that was even less significant (z-score of .2). Unfortunately, the Home Run Tracker data doesn't allow for any statements about Statcast's efficacy at tracking balls of certain velocities. As a last-ditch attempt to arrive at some conclusion, I checked for a meaningful difference in Statcast tracking rates among home runs of different elevations, and again found nothing, with a final logit giving a z-score of .5.

This analysis may have raised more questions than it answered. It's now obvious that Statcast is recording a biased sample of balls in play, but why it's doing so is not. There are few other patterns to be found, with about 7 of every 10 balls in play being recorded regardless of pitch velocity or type. Whatever is behind the bias isn't going away with time either, as the gap between recorded and unrecorded wOBA has been mostly consistent.

It's easy to look at this and be somewhat frustrated, given the lofty rhetoric that was tossed around when Statcast was first unveiled, but it's important to remember exactly what's being discussed, and just how complicated a system this is. Statcast is still picking up the majority of balls in play, and there's no reason to doubt the accuracy of the measurements it's giving for those balls, which is an amazing, amazing thing.

The frequent comparison is to 2007 PITCHf/x -- cool but unreliable -- and while that seems accurate, it's worth keeping in mind what PITCHf/x became, and the degree to which it changed baseball analysis. Is the data from Statcast ready for rigorous, earthshaking studies and analysis? No, probably not; as this article makes clear, it still needs major improvements. Any study which makes major use of Statcast data should be treated as exploratory and in need of replication as soon as complete data is released. But Statcast is already communicating things that have never been communicated before. Be excited about its current, incomplete form, and excited about what it may become when it gets those improvements.

. . .

Henry Druschel is a Contributor at Beyond the Box Score. You can follow him on Twitter at @henrydruschel.