Big, fat WAR data. - Darren Carroll
Continuing Beyond The Box Score's series on creating your own sabermetric database.
Baseball-Reference offers a wonderfully easy-to-use Play Index Tool on their site for the incredibly low cost of two dollars a day, six dollars a month, or just 36$ a year. With the PI, you can query just about any season or career info you want, and within seconds your strangest, most ridiculous sense of baseball curiosity is satisfied with the swift click of a button. Even the most obsessive, geekiest of baseball fans can walk away feeling fulfilled.
But the key words here are "just about." Because for those of us on the fringe of complete insanity, where an obsession with baseball statistics blends with pure psychosis, the PI isn't always enough. We want our own. And fortunately, the good people at Baseball-Reference offer to the public all their WAR data (including fielding, positional value, leverage index, opposition strength, et al) for absolutely free in a convenient downloadable .csv format.
You don't need to be world-class data programmer to provide yourself with this awesome, mind-blowing sense of power, either. All it takes is a little moxie and a little help from the Beyond the Box Score Community and you can be on your way to having your own massive rWAR, fWAR, and WARP database in no time.
Firstly, however, you need to start at the beginning and download SQLpro for your mac (or SQLyog for your PC), per the instructions here. Adding the Lahman database is probably a good idea, also, as it will provide you with more basic, albeit essential, information on players and teams you will likely need later on.
Secondly, then, you need to import the Baseball-Reference csv file, which can be found here. Scroll down to the middle of the page or so until you see this link:
The links provided here are functioning, but it is important to know where to find this link should you want to update your database during the 2013 season (which, face it-- you will).
The WAR data is separated into two files: one for pitchers, one for position players. When you click on each link, an extremely large csv table will appear. You will need to CTRL+A and CTRL+C that and paste it into an excel-type program.
If you know how to import a csv you should skip the next section.
The CSV file
Because I've never figured out how to properly paste csv files into mac's excel-imitator "Numbers," I use OpenOffice ( a free program) instead. You can download it free here.
- Once you have your spreadsheet open, select "Edit / Paste Special".
- Select "unformatted text", then "OK".
- On the "Text Import" page you will need to select 'Separated by' and then check the box marked 'comma'.
- Then click "OK."
Once the file is opened and it appears to be formatted properly, choose 'Save as' with '.csv' as your filetype.
Importing into SQL
Now return to your SQLpro program and open up your Lahaman database.
- Select 'File/Import'
- Find your file, click 'Open'
A window should pop up titled "CSV Import Field Mapping". You'll want your WAR table to be its own table and NOT import it into another table, so click the button marked "New" on the far right of the header tab. Name it whatever you want, naturally, but I name mine WAR_b and WAR_p.
Now, this is the part where we should probably consult the community here, because I'm certainly not well-versed enough in the ways of data storage to take you through the next step confidently. But I can tell you what I do.
There should be three columns visible to you at this point. One labeled 'CSV fields' and one labeled 'Table fields'. These should match one another precisely. The third column is unnamed, and that contins what type of data each row is storing. For instance, for names of players and teams that field should be a "VARCHAR" with parenthesis denoting the maximum number of characters that field can hold (I believe the default setting is 255). For integers like "year_id" I've chosen "INT(4)". For anything dealing with decimals, like "runs_offense" or "runs_defense", I just choose "DOUBLE."
This is probably a lazy and inefficient means of storing your data, but perhaps someone with more expertise can help you with this if you'd like.
Once you've finished , click 'IMPORT' and your new WAR table should appear in your database in a few moments.
Running a WAR Query
First thing first: index the daylights out of your new tables:
CREATE INDEX player_idx ON WAR_b (player_ID);
CREATE INDEX year_idx ON WAR_b (year_ID):
CREATE INDEX player_idx ON WAR_p (player_ID);
CREATE INDEX year_idx ON WAR_p (year_ID):
You should even index Team_ID and whatever else you feel may be helpful for the types of queries you want to write.
Then you can begin having fun. Maybe start with something simple:
SELECT player_ID, name_common, year_ID, SUM(WAR) as WAR
WHERE year_ID = 2012
GROUP BY player_ID, year_ID
ORDER BY WAR DESC
That should return you the top 25 WAR earners in the 2012 season from both leagues, with a certain spectacular rookie season at the top of the list:
You can then do the same thing with Fangraphs and Baseball Prospectus by downloading information from their leaderboards. Be sure to also check out Adam Darowski's Hall of Stats download and Seamheads as well.
As always, don't be afraid to ask even the most embarrassing questions, and someone should respond to you as soon as they can in the comments. And please, if you know of an easier or more efficient way of importing a csv file and converting it to a proper SQL table, please let us know. I'd appreciate it.
More from the Saberizing a Mac series:
. . .
Thanks to Baseball-Reference and Sean Lahman.