clock menu more-arrow no yes mobile

Filed under:

Saberizing a Mac #8: Creating a WAR database

Continuing Beyond The Box Score's series on creating your own sabermetric database.

Big, fat WAR data.
Big, fat WAR data.
Darren Carroll

Baseball-Reference offers a wonderfully easy-to-use Play Index Tool on their site for the incredibly low cost of two dollars a day, six dollars a month, or just 36$ a year. With the PI, you can query just about any season or career info you want, and within seconds your strangest, most ridiculous sense of baseball curiosity is satisfied with the swift click of a button. Even the most obsessive, geekiest of baseball fans can walk away feeling fulfilled.

But the key words here are "just about." Because for those of us on the fringe of complete insanity, where an obsession with baseball statistics blends with pure psychosis, the PI isn't always enough. We want our own. And fortunately, the good people at Baseball-Reference offer to the public all their WAR data (including fielding, positional value, leverage index, opposition strength, et al) for absolutely free in a convenient downloadable .csv format.

You don't need to be world-class data programmer to provide yourself with this awesome, mind-blowing sense of power, either. All it takes is a little moxie and a little help from the Beyond the Box Score Community and you can be on your way to having your own massive rWAR, fWAR, and WARP database in no time.

Getting Started

Firstly, however, you need to start at the beginning and download SQLpro for your mac (or SQLyog for your PC), per the instructions here. Adding the Lahman database is probably a good idea, also, as it will provide you with more basic, albeit essential, information on players and teams you will likely need later on.

Secondly, then, you need to import the Baseball-Reference csv file, which can be found here. Scroll down to the middle of the page or so until you see this link:

New WAR daily, full history: Batters, Pitchers

The links provided here are functioning, but it is important to know where to find this link should you want to update your database during the 2013 season (which, face it-- you will).

The WAR data is separated into two files: one for pitchers, one for position players. When you click on each link, an extremely large csv table will appear. You will need to CTRL+A and CTRL+C that and paste it into an excel-type program.

If you know how to import a csv you should skip the next section.

The CSV file

Because I've never figured out how to properly paste csv files into mac's excel-imitator "Numbers," I use OpenOffice ( a free program) instead. You can download it free here.

  • Once you have your spreadsheet open, select "Edit / Paste Special".
  • Select "unformatted text", then "OK".
  • On the "Text Import" page you will need to select 'Separated by' and then check the box marked 'comma'.
  • Then click "OK."

Once the file is opened and it appears to be formatted properly, choose 'Save as' with '.csv' as your filetype.

Importing into SQL

Now return to your SQLpro program and open up your Lahaman database.

  • Select 'File/Import'
  • Find your file, click 'Open'

A window should pop up titled "CSV Import Field Mapping". You'll want your WAR table to be its own table and NOT import it into another table, so click the button marked "New" on the far right of the header tab. Name it whatever you want, naturally, but I name mine WAR_b and WAR_p.

Now, this is the part where we should probably consult the community here, because I'm certainly not well-versed enough in the ways of data storage to take you through the next step confidently. But I can tell you what I do.

There should be three columns visible to you at this point. One labeled 'CSV fields' and one labeled 'Table fields'. These should match one another precisely. The third column is unnamed, and that contins what type of data each row is storing. For instance, for names of players and teams that field should be a "VARCHAR" with parenthesis denoting the maximum number of characters that field can hold (I believe the default setting is 255). For integers like "year_id" I've chosen "INT(4)". For anything dealing with decimals, like "runs_offense" or "runs_defense", I just choose "DOUBLE."

This is probably a lazy and inefficient means of storing your data, but perhaps someone with more expertise can help you with this if you'd like.

Once you've finished , click 'IMPORT' and your new WAR table should appear in your database in a few moments.

Running a WAR Query

First thing first: index the daylights out of your new tables:

CREATE INDEX player_idx ON WAR_b (player_ID);

CREATE INDEX year_idx ON WAR_b (year_ID):

CREATE INDEX player_idx ON WAR_p (player_ID);

CREATE INDEX year_idx ON WAR_p (year_ID):

You should even index Team_ID and whatever else you feel may be helpful for the types of queries you want to write.

Then you can begin having fun. Maybe start with something simple:

SELECT player_ID, name_common, year_ID, SUM(WAR) as WAR
WHERE year_ID = 2012
GROUP BY player_ID, year_ID

That should return you the top 25 WAR earners in the 2012 season from both leagues, with a certain spectacular rookie season at the top of the list:

player_ID name_common year_ID WAR
troutmi01 Mike Trout 2012 10.7
canoro01 Robinson Cano 2012 8.2
poseybu01 Buster Posey 2012 7.2
mccutan01 Andrew McCutchen 2012 7
cabremi01 Miguel Cabrera 2012 6.9
braunry02 Ryan Braun 2012 6.8
molinya01 Yadier Molina 2012 6.7
wrighda03 David Wright 2012 6.7
beltrad01 Adrian Beltre 2012 6.7
gordoal01 Alex Gordon 2012 6.2
bournmi01 Michael Bourn 2012 6
headlch01 Chase Headley 2012 6
vottojo01 Joey Votto 2012 5.6
zobribe01 Ben Zobrist 2012 5.5
hunteto01 Torii Hunter 2012 5.5
heywaja01 Jason Heyward 2012 5.5
pradoma01 Martin Prado 2012 5.4
ramirar01 Aramis Ramirez 2012 5.4
stantmi03 Giancarlo Stanton 2012 5.4
jacksau01 Austin Jackson 2012 5.2
harpebr03 Bryce Harper 2012 5
spande01 Denard Span 2012 4.8
pedrodu01 Dustin Pedroia 2012 4.7
cabreme01 Melky Cabrera 2012 4.7
encared01 Edwin Encarnacion 2012 4.6

You can then do the same thing with Fangraphs and Baseball Prospectus by downloading information from their leaderboards. Be sure to also check out Adam Darowski's Hall of Stats download and Seamheads as well.

As always, don't be afraid to ask even the most embarrassing questions, and someone should respond to you as soon as they can in the comments. And please, if you know of an easier or more efficient way of importing a csv file and converting it to a proper SQL table, please let us know. I'd appreciate it.

More from the Saberizing a Mac series:

Sky's SQL Basics

#2 Baseball Databank

#3 Win Shares

#4 Pitch f/x

#5 wOBA

#6 FIP, FIP-, ERA-

#7 year-to-year data and r

#8 WAR

. . .

Thanks to Baseball-Reference and Sean Lahman.

James Gentile writes about baseball at Beyond the Box Score and The Hardball Times. You can follow him on twitter @JDGentile.