clock menu more-arrow no yes mobile

Filed under:

Current public Statcast tracking data

There are 37 games of Statcast tracking data available on the MLB Gameday server. Here's how to find them and how to understand them.

Charles LeClaire-USA TODAY Sports

UPDATE -- Sometime in the first week after I posted these two articles, MLB removed the Statcast data from their website. Please contact me directly if you'd like access to what I have.

At the Sloan Sports Conference in March 2014, Major League Baseball announced the existence of a new data collection system that promised more information than was ever available before about the game. The new system would use multiple technologies to track both the ball and the players, giving users incredibly detailed information about everyone and everything's position and velocity multiple times per second. It was said that this system, called Statcast, would produce seven terabytes (that is, ~7000 gigabytes) of data per game, and that it would be ready to go league-wide at the start of the 2015 season.

Unfortunately, Statcast hasn't really lived up to the internet's expectations, both in terms of data availability and data reliability. Unlike what happened with PITCHF/x, there didn't appear to be any publicly-accessible source of the tracking data. For the first ten or so games of the season, there was a new file in the PITCHF/x directory that contained batted ball exit velocity and launch angle, but that rapidly stopped existing. It was later discovered that these same data were available from a different MLB source, and working from these files Daren Willman has put together and excellent resource at his Baseball Savant site for querying the available data, but it was shown this summer both here at BtBS and at FanGraphs that 1) not all batted balls had data included in that source, and 2) the batted balls that did include the new data were not randomly sampled from the total set of batted balls, but were more likely to be well-struck balls. BtBS's own Neil Weinberg found that for a reasonably random subset of data the wOBA on batted balls with data was .438, versus .364 on balls without the data.

Now, to be sure, MLB itself has been making trememdous use of the Statcast data. During broadcasts all season long we've seen on-screen data about player speed, acceleration, and reaction time, as well as exit velocity and launch angles, and this has been terrifically interesting. However, much of the information presented thus far has been in a "processed" form - MLB decided that the top speed at which a center fielder runs towards a fly ball is enough for the general public for now. Access to raw data with which analysts and hobbyists could create their own metrics is effectively non-existent to this point.

I say effectively non-existent because there is a small set of Statcast tracking data accessible from the same file system as PITCHF/x data. There are 37 games from the first few days of the season. Anyone who goes to look can access player tracking, ball tracking, and pitch information as collected by the Statcast tracking system.  Here's what those data look like.

Statcast Files

Hidden Statcast treasures are found within these folders

The existing Statcast data are found within the 'play-builder' folder at gd2.mlb.com. If you go to http://gd2.mlb.com/components/game/play-builder, you're presented with a list of about 100 sub-folders, all of which are labeled numerically. Each of these contains Statcast tracking data; however, only 39 of them correspond to regular-season games. These are labeled with the same six-digit game id number you'd find anywhere in MLB's data structure for these games.  It's not clear to me what the other folders are from; the files within them that contain dates are indicate that they're from late March or early April, so it seems likely to me that they're from some sort of testing project.  MLB games are numbers 413649 through 413693.

Within an individual game folder, there are either three or four subfolders: Combined, Event, Live, and Refined. Inexplicably, "Event" is missing from some games.  Each folder can contain anywhere from just a few to literally 200,000 JSON files. The "Combined" folder looks to contain data assembled from the other folders, and I'd speculate that it has something to do with the public presentation of the information. There are only a few plays' worth of data in this folder, and from here on out I'll be ignoring it.

A quick note: all files mentioned below are in JSON, or JavaScript Object Notation, format. It's a way to hierarchically store data in a text format.

The "Event" folder contains files that are sort of akin to each play's metadata. There are usually about 10 files per recorded play, and in total anywhere from a few hundred to nearly a thousand per game; the file names appear to contain timestamps, and it would make sense if they corresponded to something like "time of data collection", but I haven't been able to reconcile them.  Event files, where they exist, describe what happened during a play and who was on the field at the time. Positions 1-9 are the fielding lineup, position 10 is the batter, 11-13 are runners (and the field is "NULL" is the base is empty), and 14-19 are the umpires. The file further tells you who was specifically involved in the play, and what the situation was (balls, strikes, outs, inning). Here's an example.

{"gpk":413649,"guid":"06200151-5a82-47f0-bec1-d59de44e2a83","lineup":[{"pos":1,"id":433587},{"pos":2,"id":572287},{"pos":3,"id":489149},{"pos":4,"id":429664},{"pos":5,"id":572122},{"pos":6,"id":543543},{"pos":7,"id":554429},{"pos":8,"id":457706},{"pos":9,"id":452234},{"pos":10,"id":459964},{"pos":11,"id":405395},{"pos":12,"id":null},{"pos":13,"id":null}],"umpires":[{"pos":14,"id":427520},{"pos":15,"id":427292},{"pos":16,"id":427286},{"pos":17,"id":483569},{"pos":18,"id":null},{"pos":19,"id":null}],"event":[{"pos":10,"details":[{"pos":4,"typ":"f_assist","id":429664},{"pos":6,"typ":"f_putout","id":543543},{"pos":1,"typ":"p_ground_out","id":433587}],"typ":"force_out","id":459964}],"runners":[],"sit":{"outs":2,"balls":1,"top_inning":1,"strikes":0,"inning":4}}

This tells you the game id number (413649), the play id number (guid), the player and umpire id numbers (e.g., the pitcher's ID is 433587), and then some event information. In this case, it's saying there was a batting event ("pos":10), there was a fielding assist by the second baseman ("pos":4,"typ":"f_assist"), a putout by the shortstop ("pos":6,"typ":"f_putout"), a groundout credited to a the current pitcher ("pos":1,"typ":"p_ground_out"), and the play in general was a force play ("typ":"force_out"). The file also describes the game situation in a weird mix of pre- and post-play information. Following this play there are two outs, and this occurred on a 1-0 count in the top of the 4th.  This particular game is the Mariners' opener; this was a 4th-inning fielder's choice hit by Matt Joyce where Pujols was forced out at second, FC 4-6.  It's not clear to me why there are so many files per play, because the information doesn't appear to change between any of them.

The "Live" folder is much larger. It contains JSON files with XY positioning of all people recognized by the system, as well as some further information that varies in type file-to-file. These files contain a few timestamps (on different scales), x-y positioning for everyone on the field, x-y-z positioning for the ball (though it's usually NULL), and data from one or more "pkgs". Type 0 contains pitch speed (out to a ridiculously absurd 13 decimal places) and not much else; Type 1 has more detailed pitch info - some PITCHF/x-style stuff, 3D velocity, spin, approach angles. Type 2 is the most common and the longest by far (usually over multiple files), and is position and trajectory data for the ball over the course of the whole play. Types 4, 9, and 10 are all part of measuring the flight of a ball in the air (whether it's a pitch or a batted ball), and there seems to be a bit of overlap between them. Types 7 and 8 describe the precise circumstances and results of the plate appearance. Types 3, 5, and 6 seem to not exist, at least in this data set. There are about 4000 live files per game.

Here's a type-0 example:

{"ver":6,"ts":1428462483739,"fts":"2015-04-08T03:08:03.739","gpk":413664,"gm":2,"guid":"00000000-0000-0000-0000-000000000000","trgts":[{"typ":1,"x":-1.478,"y":-6.057,"id":2},{"typ":1,"x":-21.236,"y":325.427,"id":8},{"typ":2,"x":-0.59,"y":-8.179,"id":14},{"typ":1,"x":1.862,"y":-1.278,"id":10},{"typ":3,"x":97.573,"y":67.806,"id":18},{"typ":1,"x":-0.54,"y":60.04,"id":1},{"typ":2,"x":23.745,"y":169.371,"id":16},{"typ":2,"x":-83.932,"y":83.217,"id":17},{"typ":0,"x":81.196,"y":7.794,"id":0},{"typ":1,"x":23.213,"y":153.989,"id":6},{"typ":2,"x":101.888,"y":100.841,"id":15},{"typ":3,"x":-86.111,"y":54.543,"id":19},{"typ":1,"x":-41.379,"y":87.434,"id":5},{"typ":1,"x":80.243,"y":101.701,"id":3},{"typ":1,"x":-153.055,"y":236.459,"id":7},{"typ":1,"x":131.696,"y":269.93,"id":9},{"typ":1,"x":77.885,"y":150.502,"id":4},{"typ":4,"x":-2.606,"y":-0.512,"z":1.411}],"pkgs":[{"typ":0,"data":{"PitchReleaseData":{"MeasurementID":"1317f589-dce8-4d81-9690-0c14b6b9f3ca","TimeCodeOffset":0.0179291,"ReleaseSpeed":91.6747224053967,"TimeCode":251511094,"Time":"2015-04-08T03:08:06.9106724Z"}}}]}

And here's a very abbreviated type-2 example (they're usually much, much longer):

{"ver":6,"ts":1428462484006,"fts":"2015-04-08T03:08:04.006","gpk":413664,"gm":2,"guid":"00000000-0000-0000-0000-000000000000","trgts":[{"typ":1,"x":-1.51,"y":-6.09,"id":2},{"typ":1,"x":-21.301,"y":325.428,"id":8},{"typ":2,"x":-0.59,"y":-8.114,"id":14},{"typ":1,"x":1.959,"y":-0.92,"id":10},{"typ":3,"x":97.638,"y":67.806,"id":18},{"typ":1,"x":-0.54,"y":60.04,"id":1},{"typ":2,"x":23.745,"y":169.371,"id":16},{"typ":2,"x":-83.964,"y":83.185,"id":17},{"typ":0,"x":81.392,"y":7.794,"id":0},{"typ":1,"x":23.311,"y":153.956,"id":6},{"typ":2,"x":101.823,"y":100.842,"id":15},{"typ":3,"x":-86.046,"y":54.412,"id":19},{"typ":1,"x":-41.379,"y":87.434,"id":5},{"typ":1,"x":80.243,"y":101.701,"id":3},{"typ":1,"x":-152.695,"y":236.23,"id":7},{"typ":1,"x":131.794,"y":269.897,"id":9},{"typ":1,"x":77.788,"y":150.47,"id":4},{"typ":4,"x":-2.606,"y":-0.512,"z":1.411}],"pkgs":[{"typ":2,"data":{"LiveTrajectoryData":{"MeasurementID":"1317f589-dce8-4d81-9690-0c14b6b9f3ca","BallPositions":[{"BallPosition":{"TimeCodeOffset":0.0044843,"Position":{"Z":1.80596575267904,"Y":1.25173432735839,"X":-0.962897793727265},"TimeCode":251511107,"Velocity":{"Z":-8.11149569151268,"Y":-82.7209249184464,"X":-3.27094669302707},"Type":"Measured","Time":0.42}},{"BallPosition":{"TimeCodeOffset":0.0144843,"Position":{"Z":1.67296949011135,"Y":0.0336405433203741,"X":-0.994661077262214},"TimeCode":251511107,"Velocity":{"Z":-8.34550814871547,"Y":-82.6179029101653,"X":-3.22471171687226},"Type":"Measured","Time":0.43}}]}}}]}

Lastly, there's the "Refined" folder, which is by far the largest of the four types -- there are roughly 100,000-200,000 files per game there. These files are the extremely granular, two-dimensional tracking files for on-field people detected by the system. The timestamps within these files indicate that the data are being recorded every hundredth of a second. These files have a similar structure to the beginning of the "live" files, giving a "typ:, x, y, and "id" for each tracked object; however, the "typ" and "id" don't always have data associated with them, making it difficult to actually use the files to track players. Here's an example of what it looks like:

{"ver":6,"ts":1428353974371,"fts":"2015-04-06T20:59:34.371","gpk":413655,"gm":2,"guid":"0276bfe3-bb86-423e-812e-d2a4a97ffe1d","trgts":[{"typ":0,"x":2.138,"y":-6.013,"id":0},{"typ":0,"x":-0.334,"y":-7.216,"id":0},{"typ":3,"x":85.052,"y":57.659,"id":18},{"typ":1,"x":144.716,"y":248.342,"id":9},{"typ":2,"x":-98.014,"y":97.747,"id":17},{"typ":1,"x":59.663,"y":101.488,"id":3},{"typ":3,"x":-103.492,"y":77.369,"id":19},{"typ":0,"x":0,"y":0.534,"id":0},{"typ":2,"x":87.357,"y":85.62,"id":15},{"typ":1,"x":-41.825,"y":138.435,"id":6},{"typ":2,"x":31.235,"y":170.271,"id":16},{"typ":1,"x":-0.468,"y":59.864,"id":1},{"typ":1,"x":24.353,"y":153.969,"id":4},{"typ":1,"x":23.117,"y":309.208,"id":8},{"typ":1,"x":-73.193,"y":100.386,"id":5},{"typ":1,"x":-123.803,"y":271.192,"id":7}],"pkgs":[]}

So, that's about it. There are a total of about 125,000 lines per game (though I haven't done the exact math on that), meaning there are about 4.5 million rows of data from these 37 games. However, that doesn't tell the whole story.

If you dig into the PITCHF/x files, you'll find that included with many (but importantly, not all) of the individual recorded pitches is a field labeled "play_guid." You may have also noticed that there's a field called "guid" in all of the files types I described above. I observed that many of the Statcast files had GUIDs that matched those found in the PITCHF/x files, which seemed too unlikely (given their length) to be random, and I was able to confirm via some of the live files that the data do correspond with each other. However, only about half of the Statcast files can be matched to a particular pitch. It's not clear at all to me where the extra data come from; a good example can in the first game recorded, which was Anaheim v. Seattle on the 6th of April. There's a particular Peter Bourjos plate appearance that lasted four pitches, but shows data recorded for nine pitches. There was speculation throughout the season that there were issues with the quality of the data being produced by the Statcast system; taken with the missing batted ball data from the other source, this seems like excellent confirmation that this is true.

The available data have left me with a somewhat paradoxical feeling -- it simultaneously seems like there's too much and not enough data. Having only 37 games is too small of a sample to be able to establish anything about individual players, especially since there are only at most four games per team, but having nearly 200,000 data points available per game requires a good plan for data management and analysis in advance.  That's the point of writing this -- even though only a few games' worth of data are available, my hope is that by providing the public at large with a description of and a method for acquiring what's out there good analytical methods can be established in advance of what will hopefully be a much larger release of data next year.

In a separate article, I've detailed a script I've written in Python that downloads all the important Statcast data and organizes it in what I think is a logical way.  The article also contains a few examples of the visualizations possible with the tracking data.

----------

John Choiniere is a researcher and occasional contributor at Beyond the Box Score.
You can follow him on Twitter at @johnchoiniere.