/cdn.vox-cdn.com/uploads/chorus_image/image/47246928/header.0.0.png)
There are, admittedly, a lot of sources for PITCHf/x data that are publically accessible. Brooks Baseball has been an indispensible site for anyone doing PITCHf/x-based research; it features both important data corrections and pitch-type hand-coding that's far more accurate than the MLBAM neural network-based classifications. Baseball Savant is newer, and while it lacks the manual pitch classifications of Brooks, it has a very sophisticated and useful search tool. Jeff Zimmerman runs his own website, BaseballHeatMaps, that has a downloadable file containing the data for importing into the program of your choice. A package exists for the statistical analysis program R called PitchRx, written by Carson Sievert, that downloads the data. Lastly, Mike Fast's original Perl scripts, which in turn were modified from Joseph Adler's Baseball Hacks, are still available (and functional with some slight modification).
A common factor across all of these is a focus on the PITCHf/x data fields themselves — any extra information, such as plate appearance outcome or game situation, isn't included directly in the pitch file produced. Sometimes it comes in supplementary data files; this is better than nothing, but behind the scenes here at Beyond the Box Score there's been a desire for a data source that unifies the granular pitch data with things like plate appearance outcome. I've created something to fill that role: a Python-based PITCHf/x scraper that includes much (if not all) of the extra information. I'll take you through what it is, where to find it, and how to use it.
A single game within the MLB Gameday directory.
For those who are interested but aren't aware of it, all PITCHf/x data remains publically accessible through the MLB Gameday directory. Here's an example: the directory for the 2009 Game 163 between the Twins and Tigers. There's a lot of information there, and in fact at the very beginning of this year there was highly-detailed Statcast data in the same place, though that rapidly disappeared; anyway, for now we'll focus on the subdirectory called "Inning". Within it, there are a number of XML files: one for each inning, one that combines all innings, one that only gives (less than useful) hit information, and one for scoring plays. Everything but the "hit" file contains either the complete set or a subset of the pitch data for the game.
My parser uses the "inning_all.xml" file, to minimize the number of files it needs to open. Iterating through the file, my Python script grabs all the pitch-specific data available; I'd go through them in detail, but since Mike Fast already has, there's no need to repeat the information. Where my scraper differs from the previously-existing ones is in the extra information I include with each pitch. My scraper reports the following:
- Retrosheet-style game ID
- Flags to indicate whether the game is from spring training, the regular season, or playoffs (and which playoff round, if applicable)
- The MLB game ID number (especially useful if they ever restore Statcast data to the publically-accessible directory)
- Game location data
- Batter/Pitcher ID data
- Game situation data (current inning, score, count, Retrosheet-style base state pre- and post-plate appearance)
- Pitch outcome sequence up to that point in the plate appearance (e.g., BBSBFX to indicate ball-ball-strike-ball-foul-in play)
- A flag to designate if the pitch is the last pitch in the plate appearance
- Retrosheet-style event code
The data is output into a file called "pitch_table.csv", with one line per pitch. It also produces an "atbat" CSV file with some summary information for each plate appearance in the pitch file.
The first lines of the pitch_table output file.
As you might be able to tell, based on the somewhat extensive game meta-data included with each line, I intended the output to be imported into a relational database (I use MySQL, personally). However, the output format is generic enough that you should be able to work with it using whatever program you'd usually use for CSV files. Probably not Excel, though, unless you subset the file first; the complete collection of PITCHf/x data (back to 2008) takes up right around 2 GB when scraped/parsed this way. I'll include at the end of this post links to the SQL I used to create the tables that store this data for me.
In order to use this scraper, you have to have Python installed, of course, as well as a few extras. I developed the script using Python 3.4; I can't say whether or not it would work using a different version, although I have my doubts that you could successfully use it on 2.7 without some modification. I also developed it using Windows, but I don't know any reason it wouldn't work just as well on another OS. Still, the rest of this will be written from a Windows-centric perspective.
Python can be installed from the Python.org download page. Once it's in place, you probably want to add the Python directory paths to your Environmental Variables; this will allow you to run a Python script from any directory, not just the Python bin directory. After you do that, there are up to two extra modules for Python that need installing: BeautifulSoup and LXML. BeautifulSoup is absolutely required, since it interprets all the XML; LXML is an XML parser, so if you have one already installed or have one you prefer, it should work just as well (and I deliberately didn't specify the parser to use in the code, so the machine should pick the best, or at least an acceptable, one).
BeautifulSoup can be installed by opening a command prompt (either anywhere if you modified the environmental variables or within the Python bin folder if you didn't) and executing either "pip install beautifulsoup4" or "easy-install install beautifulsoup4". You may need to upgrade pip first; if so, the command is pip install --upgrade pip.
The command to install Beautiful Soup.
LXML, as mentioned previously, is an XML interpreter for Python, which could in theory be replaced by a different interpreter — I used LXML during development and know that it works. However, it can definitely be harder to install than BeautifulSoup was. I had success using the unofficial pre-built version from Christoph Gohlke at UC-Irvine, which can be found here. This avoids the need to compile the binaries yourself, and the need to have a C++ compiler installed (I think — if you still need one, there's a Microsoft Visual Studio installer that has what you'll need. Google around for installing lxml).
It's important to get the WHL file from Gohlke's site that matches your Python installation, both version number and processor architechture type. For example, if you installed Python 3.4 in the 64-bit version, you need lxml-3.4.4-cp34-none-win_amd64.whl, and no others will actually work. Installation is done using pip. Open a command prompt in the location where you downloaded the WHL file, and run the command pip install lxml-3.4.4-cp34-none-win_amd64.whl (or whatever version of the file you actually have).
Once you have lxml installed (and for me, that was the most challenging part of the process), you should be good to go. Open a command prompt in the location that you saved the scraper, and run the command python pfx_parser_csv.py. The script will ask you if you want to choose the starting and ending dates (separately); if you do, for whichever one(s) you choose it'll prompt you for the year, month, and date. If you don't, the default is to start January 1, 2008 and go until the day before you run the script. The output files, pitch_table.csv and atbat_table.csv, will be produced in the same directory that you're running the script from.
If you dig through the code, you'll find some peculiarities (ideally less of them the farther from the original date of this post you get the script, as I intend to clean it up when I have time). There's some leftover code that, if un-commented within the script, will ask you if you want to continue the script from the last date found. This was intended to read an existing output file, find the last date done in that file, and start from there; I never finished making that work. There also may be some leftover code that directly inserted the scraped data into a MySQL table, but I abandoned that in favor of the CSV output (I just might not have deleted the code). Lastly, on line 81 (as of the time of this writing) there's a commented-out line that inserts a one-second delay between games. If this is your first exposure to web data scraping, you might not be aware that it's generally considered courteous to have something like that built into your scraper, especially if the data source is a small(ish) organization/company, so their servers aren't being bombarded with requests quite as rapidly. I don't use the feature for this particular script, for two reasons. First, I assume that MLBAM deals with so much traffic every second of every day that running this script goes unnoticed; second, the individual games take long enough to parse on my machine that there's already an unavoidable second or two between opening game files. Still, if you find that the script repeatedly crashes due to connection errors but everything's fine with your internet connection, you might consider turning on that delay (by deleting the pound sign at the start of the line).
This is what the window will look like while the scraper runs.
If your script does crash — which in my experience is reasonably likely to happen if you're gathering more than a year's worth of data at once — the best course of action is to resume the collection, rather than starting over, by seeing in the command prompt window what day was in progress and then choosing to start the script with that day. If the day is an off-season day, then you're good; if it's in-season, there will probably be some duplicates in the output files. These can be removed by running the pitch_dup_remove.py and atbat_dup_remove.py scripts found here and here, respectively, after data collection is complete. These scripts run fast, but as I understand it do put the entirety of the output into memory, so you need to have enough memory available to accommodate whatever size the pitch table ends up. My system runs Windows 10 and has 6 GB of memory in total, and it couldn't handle it — I had to switch to my linux installation to get it to work. If you have trouble, you could try re-running the original script in smaller date chunks. My experience is that collecting all seasons of data at once will take a few days of non-stop scraping.
I think that just about covers everything. I'd like to thank the people who helped me get the code up and running, both in its current form and when it was still a Perl script, as well as all the BtBS staff who requested data that I couldn't get without writing this script. I'll check back in frequently with this post, so leave a comment if you have any questions/trouble with this and I'll try to help. I hope you find this useful!
Links:
pfx_parser_csv.py (main script)
pitch_dup_remove.py
atbat_dup_remove.py
mysql_table_creator.sql
John Choiniere is a researcher and occasional contributor at Beyond the Box Score.You can follow him on Twitter at @johnchoiniere.