clock menu more-arrow no yes

Filed under:

Statcast scraper and vizualization examples

New, 3 comments

How to download the publicly-available Statcast data, and some quick/easy examples of what can be done with it right now.

John Choiniere

UPDATE -- Sometime in the first week after I posted these two articles, MLB removed the Statcast data from their website. Please contact me directly if you'd like access to what I have.

In a separate article today, I wrote about the existence of a small set of Statcast tracking data available to the public via the same file system as PITCHF/x data. Here, I'll describe how to get it for yourself, and I'll show a few very simplistic things you can do with it.

I've posted to my Github account a Python-based scraper for the Statcast data.  It runs through Event, Live, and Refined data files and extracts the information that I found to be relevant. Importantly, though, it doesn't look through literally every file. As I wrote in the other article, there are a lot of data files in the Statcast directories that don't seem to match up to anything that happened in real life. I've accounted for this by having the scraper first use the PITCHF/x files for each game to identify which files correspond to actual pitches.  Doing this cuts out about half of all Statcast files.  Doing this also, unfortunately, means some data are missed -- not every PITCHF/x-recorded pitch includes the identifier I used to check for reallness of the Statcast data -- but I feel the benefit of excluding probable nonsense data vastly outweighs the cost of losing some fairly small amount of signal.

As with my PITCHF/x scraper, running the script requires both BeautifulSoup and an XML parser (I use lxml) to be installed. After that, running the script is reasonably straightforward; it will create its own file structure within whichever folder you run the script from. The trickiest part will probably be being able to run the script long enough without a connection error. There are so many lines of data to record that running the script on all 37 games in a row would take 6-8 days of continuous running. Conversely, I was typically only able to run for twelve hours at most before it failed due to an inability to connect to the MLB site. Unfortunately, to avoid any duplication of data I had to delete any incomplete games when the script failed, which adds significantly to the total time taken. I worked around this by creating a separate script for each game; I'll leave that to the user to do through script modification if you want.

The first example is from the Angels/Mariners game on April 6th. In the first inning, Felix Hernandex struck out Albert Pujols on seven pitches. All seven were recorded by the Statcast system; for each, the ball's location and velocity were tracked every hundredth of a second, meaning there are about 40 data points per pitch. I used the "rgl" package in R to create a 3D scatterplot of each pitch's position (using connected points, so it looks like a line). The red lines are strikes, the green lines are balls, and the blue lines are fouls. I included an approximate strike zone, which is a plate-shaped region between 1.5 and 3.5 feet off the ground. Click on the image to open a new tab/window containing a manipulatable 3D plot object.

Next is an example of batted ball data. Also on April 6th, the Red Sox played at the Phillies. In the 3rd inning, Mookie Betts hit a home run that put the Sox up 2-0. Using the same system I described above, I created a 3D plot of the hit's trajectory. Click the following still image to access a manipulatable version of this.  Also, for any Phillies fans reading this, I know Citizens Bank center field isn't exactly correct in this image, but it's close enough for my purposes.

Lastly, here's an example of combined player and ball tracking. On the 7th of April, Nolan Arenado hit a double on a liner to left field during the Rockies game against the Brewers, scoring Troy Tulowitzki from second base. In the following, you can see how each fielder moved during the course of the play, as well as Tulo, Arenado, and the ball. I'll point out, though, that the system seems to lose track of the ball sometime shortly after its first bounce. As above, click for a version you can manipulate.

That's it for now. As I said in the other article, I'm posting this not because I think there's a ton to be learned from only 37 games' worth of data, but rather in the hope that people can use them as a template for getting ready for what will hopefully be a much bigger release of data next year.


John Choiniere is a researcher and occasional contributor at Beyond the Box Score.
You can follow him on Twitter at @johnchoiniere.