Hi all. With Sky being "hired by the Rays*" and all, I thought I would try to continue BtB's series on equipping yourself with best and geekiest tools to analyze baseball. If you've missed the previous 4 installments, and you can find them all here, we took a look at how to set up an SQL database, upload 100+ years of baseball data via the Baseball Databank, add data to the Baseball Databank, and write queries to analyze those stats. Hopefully, those who are interested in continuing this series are up to speed. If not, you an of course as questions in the comments section and I'm sure one of BtB's excellent commenters will be able to help.
Today, I wanted to take a look at one of the newer innovations in baseball analysis... Pitch f/x. Implemented sometime in 2007, Sport Vision's Pitch f/x cameras record information on each pitch in the majors, including vertical and horizontal position, velocity, initial acceleration, vertical and horizontal movement, spin angle and many other useful bits of information.
This system has opened up a whole new avenue of baseball analysis. We are now able to see the nastiness of Neftali Feliz in graphical form and the sometimes hilarious innacuracy of umpires. Anaylist's like Josh Kalk, who are well refined in the craft, have gone on to work for major league teams. And BtB's own Harry Pavlidis is not far behind him. It's even reaching major league pitchers! Brian Bannister, through extensive use of Pitch f/x data, has transformed himself into a quality major league pitcher.
Okay, enough posturing. I, for one, would like to get my hands on the data and see what I can do. Follow to the jump for more info.
The first step in creating a Pitch f/x database is to dowload the data. If your interested in simply looking at a single games worth, Dan Brooks' site offers some great tools to that; however, if you are looking to set up a database in which you can look at every and any pitchers aggregate data, the process becomes a little harder.
One such source is avaible for free here. BtB's Jeff Zimmerman and his brother have taken the time to download all of the Pitch f/x files into a single SQL dump, and they update it daily. While this is a great resource to have, and I have used it more than a couple of times, several components are missing (namely pitch type) and you have to re-download the file each day for up to date usage.
To avoid doing that, and to get more thorough pitch data, you have to set up Perl Scripts to automatically download the data. Unfortunately, most people are like me and have no idea how to manipulate and run Perl Scripts to do that for you.
The most comprehensive tutorial I found on the matter was written by Mike Fast. However, he wrote it just under 2 years ago, and some of the information may be outdated. Indeed, when I tried to implement his methods, I encountered a slew of problems. So, I thought that we could walk through his tututorial here and try to solve it on our own.
The first step is downloading the data. From Mike:
Whether or not you want to buy the book, the Perl scripts he wrote for the book can be downloaded from the examples section of O’Reilly’s website. Download and uncompress the baseball_hacks_code.zip file. It contains all the scripts from the book, divided by chapter and hack number
That sounds pretty easy so far. Here is a screenshot of what the downloaded file will look like, once you are able to uncompress it and open it:
Next step from Mike:
The first script of interest for XML downloading is hack_28_spider.pl, found in Chapter 3. This script can be used after only a few minor modifications, as follows:
To download the 2007 season starting with April 2, change Line 35 from
$start = timelocal(0,0,0,20,6,105);
to
$start = timelocal(0,0,0,2,3,107);
So, I'm pretty sure that you want to change the first line to (0,0,0,2,4, 107), that must have been a typo on Mike's part. This will start downloading the data from the 2nd of April in 2007. Then:
Similarly, you need to change the end time. Uncomment Line 40 and comment Line 41:
$now = timelocal(0,0,0,$mday - 1,$mon,$year);
#$now = timelocal(0,0,0,3,10,105);
This is where I run into trouble. Being a complete and udder idiot when it comes to computer programming, I have no idea what "comment" and "uncomment" mean. This is the step where you specify the end date from which you want to download the data, so it is obviously very important. Assuming that someone will be able to fill me in on this later, let's proceed to the next step.
Mike:
Another change is to correct for the fact that MLB now puts player information in an XML file rather than a TXT file.
Change the players.txt in Lines 95, 96, 101, and 102 to players.xml:
from
if($gamehtml =~ m/<a href=\"players\.txt\"/ ) {
to
$plyrurl = "$dayurl/$game/players.txt";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.txt"
or die "could not open file $gamedir/players.txt: $|\n";
if($gamehtml =~ m/<a href=\"players\.xml\"/ ) {
$plyrurl = "$dayurl/$game/players.xml";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.xml"
or die "could not open file $gamedir/players.xml: $|\n";UPDATE:The spider needs to be changed to download the game.xml file. Since the game.xml file is in the same directory as the boxscore.xml file, you can duplicate that section of the code and change "boxscore" to "game", like so:
if($gamehtml =~ m/<a>get($gameurl);
die "Couldn’t get $gameurl: ", $response->status_line, "\n"
unless $response->is_success;
$infohtml = $response->content;
open GAME, ">$gamedir/game.xml"
or die "could not open file $gamedir/game.xml: $|\n";
print GAME $infohtml;
close GAME;
} else {
print "warning: no xml game file for $game\n";
}
Sounds simple enough, you should all be able to do this. If it isn't clear, this is on the same text file that you are editing the start and end dates in, which looks something like this:
The picture is for reference only, and doesn't depict the actual changes you should make. As long as you followed the first two steps shown above, it should be pretty simple.
After you've changed the txt files to xml files, the next step is to download Perl and run the script. From Mike:
Before you can run it, though, you need to have Perl installed on your computer. You can download the Perl binaries and package manager for ActivePerl here.
This is pretty simple. Once you click on the link, you can download the file free off of their website. Just click on the "download" link, skip where it asks you to provide your name (or fill it in, whatever) and click on whatever your operating system is. Active Perl takes about 2 minutes to download, then you can click on it and it will start the intilation process. If that doesn't work, then you are probably selecting the wrong opperating system.
Once it installs, it will show up on your desktop. Just leave it there for now.
Next step:
Once you get Perl installed you should be ready to go. I made a separate directory to hold my game data. If you’re running from Windows, just open a command prompt window, cd to your game data directory, and then run the hack_28_spider.pl script.
Again, this is where I run into problems. I don't know what a directory is, or how it affects the process. Furthermore, I don't know how to "cd to your game data directory" and run the hack_28_spider.pl script (this was the one we changed right above). If anyone can help me or other people who might be having confusion as well, that would be awesome.
Okay, so once you have done that, you have to download the data to your SQL database. From Mike:
The next steps are setting up a database and adapting the hack_28_parser.pl script to input the data into the database.
Here is a link to my PBP database structure. This is a MySQL database. If you are already familiar with MySQL, then this will give you a head start.
UPDATE: Mike has an "updated" structure that includes other goodies!!
Here is my new database structure for 2008 with these fields added to the pitch table
For those using MAMP and Sequel Pro, you may have to create the tables manually. This isn't a big problem, but it's a bit tedious. To do so, you open Sequel Pro, and go to "Database" - "Create new database". I labeled mine Pitchfx. Then you create 5 tables, by pressing the + button under the table section. You should label them "atbats", "games", "pitches", "players" and "umpires".
Then, click on the first one and go to "structure". Then you can click on the + button to add columns to each table. The columns you should add are detailed by Mike in the link in the last quote. So, for the atbats table you would create 15 columns, starting with "ab_id". Then you would creat the next 14 shown.
Do this for each different table, and you have a nice a simple database structure to house all of your data. It will look like this:
The next step from Mike:
Also, here is a link to the code from my XML-to-database parser script. It’s still a work in progress, so please don’t mind the mess inside, but it works if you want to use it in its current state. I’ve fixed it up a bit from an earlier version I posted, cleaning up some of the subroutine calls. (Note: the copy now posted is yet a newer version that also processes umpires and fielding locations for balls in play.)
UPDATE: You can download the new database parser script to use these fields. I have an additional script to update the pitches table with the ball-strike count at each pitch.
Running the XML-to-database parser script will require the Perl DBI::mysql package to be installed. If you don’t have that already, open the Perl Package Manager (under the Active Perl program group if you installed Active Perl). Under the View menu, select All Packages, and look for DBI and DBD-mysql. If they aren’t listed as installed, click on them to select them, and then go to the Action menu and Install them.
That XML-to-database parser script is the key! This is what transfers the the data you download from the MLB.com website to Sequel. He revealed later that it was updated later with some information provided by Harry Pavilidis, so it looks like it should be updated. If Harry wants to provide his most recent script, that would be great as well.
I'm not sure about the second paragrpah though. I (We) already have both downloaded Active Perl, and we already have MAMP and Sequel up and running. I assume you need to connect the two somehow.
Mike says that you have to open the Perl Package Manager, yet I can't seem to find that on my computer anywhere. Is this something that you need to download seperately? Again, if someone can help me here, that would be awesome.
The next step provided in Mike's tuturiol is how to download and use MySQL. Since most of you here are using Mac's, I'll refer you to other installments in our series, where Sky shows how to download and use Sequel Pro with MAMP on the Mac.
Consider this a laymen's thought process when reading Mike's tutorial. Hopefully, more computer savy guys like Dan and Harry can help answer my questions and any others that other readers have. Or if anyone else has anymore information on the matter or can point to a different tutorial, awesome!
*"Put down"