There was a recent fanpost here at Beyond The Box Score requesting assistance forming some basic baseball-related queries with SQL. In that thread, a number of other BTBers, both writers and readers alike, reported interest in conducting a sort of beginner's tutorial as well. So it would seem like today is the perfect opportunity to end the long hiatus of Beyond The Box Score's immensely popular DIY database-building series, "Saberizing a Mac."
If you are not familiar with SQL, but would like to begin setting up a baseball database, then I suggest you start from the beginning, because this will all seem like gibberish to you otherwise:
Part 1: Installing the Baseball Databank Database
After you're set up, I would then refer you to Colin Wyers's excellent beginner's tutorial now featured at The Hardball Times. This should get you acquainted with some basic query-language. Have fun with it, explore strange criteria, make up your own stats, etc.
If this all seems far too intimidating to you, know this: Up until 4-5 years ago I had absolutely zero experience with any form of data programing. The last computer class I took involved booting up a floppy disc of Oregon Trail on MS-DOS circa 1989. I've had zero formal training in any of this, but I've since built a fairly radical database after reading one article by Colin Wyers at the now-defunct Statistically Speaking website sometime back in 2007-ish. So if I can do it, so can you. No excuses.
Since the html in the comments section in the Fanpost isn't exactly code-friendly, I'm instead going to respond to one of the more common code requests I've gotten, adding wOBA to your database.
I would first begin by downloading Fangraphs "GUTS" as a csv file, and then importing it into your SQL program as a separate table. SQL programs may differ on how to import a file as a table, but for my SQLpro (downloaded per the Saberizing a Mac series) it works like this:
- Select File/Import.
- Find the GUTS file. Click "Open"
- Click "New". Name the file "GUTS." Click "Import."
It's extremely helpful then to index the "season" field right away. This is a nice habit to get into as it speeds up your queries significantly, thereby allowing you to get back to that incredibly successful social life as soon as possible.
CREATE INDEX year_idx ON GUTS (season);
We can then join the 'Batting' and 'GUTS' tables to create the wOBA formula of our own choosing. This is the code for Fangraphs' current wOBA formula:
SELECT b.playerID, b.yearID
, CONCAT(m.namefirst," ",m.namelast) as Name
, SUM(b.ab)+SUM(b.bb)+SUM(b.hbp)+COALESCE(SUM(b.sf),0)+COALESCE(SUM(b.sh),0) as PA
, ROUND(((SUM(b.bb)-COALESCE(SUM(b.ibb),0))*g.wbb + COALESCE(SUM(b.hbp),0)*g.whbp +
(SUM(b.h)-SUM(b.2b)-SUM(b.3b)-SUM(b.hr))*g.w1b +SUM(b.2b)*g.w2b +SUM(b.3b)*g.w3B +SUM(b.hr)*g.wHR)/
(SUM(b.ab)+(SUM(b.bb)-COALESCE(SUM(b.ibb),0))+COALESCE(SUM(b.hbp),0)+COALESCE(SUM(b.sf),0)),3) as wOBA
FROM batting b, master m, GUTS g
WHERE b.playerID = m.playerID AND b.yearID = g.season
AND b.yearID = 2012
GROUP BY playerID, yearID
HAVING PA > 499
ORDER BY wOBA DESC
That should return you the top 50 leaders in wOBA for the 2012 season with at least 500 PA's. It should line up exactly with Fangraphs leaderboards here. If you notice a discrepancy somewhere, let us know and we'll try and smoke it out.
My top ten looks like this:
If your query returned something different, let us know in the comments and we'll try to help you figure out what the issue is.
Remember, I started out from nothing, so I have an enormous debt to pay forward, so don't be afraid to ask the most embarrassing, basic of questions!
More from the Saberizing a Mac series:
Thanks to Fangraphs, Colin Wyers, and The Hardball Times.
Follow @JDGentile on twitter.