clock menu more-arrow no yes mobile

Filed under:

Saberizing a Mac #9: Retrosheet (part 1)

The next step in creating your own sabermetric database.

Darren Carroll

For those who are not aware, Retrosheet.org is a fantastic organization that (among many other wonderful things) aims to collect historical play-by-play descriptions from as many games as possible dating back into the furthest reaches of baseball history. Incredibly, this goal has been realized for the most part for seasons 1950 to 2012. And fortunately for the aspiring and ambitious saberist, Retrosheet provides the entirety of this collected data to the public for absolutely no cost.

With the Retrosheet event files in your arsenal, just about any split you can imagine becomes possible. In fact, just about any statistical baseball inquiry at all becomes possible. If you want to explore the most obscure avenues of the game, like very particular baserunning habits over the years, first-pitch outcomes, situational run expectancy, or even the greatest bunters of all-time-- then Retrosheet is your best friend.

Getting Started

If you haven't been following along with the Saberizing series then I suggest you consider starting at the beginning and first download SQLpro for your mac (or SQLyog for your PC), per the instructions here. Adding the Lahman database is probably a good idea, too, as it will provide you with more basic, albeit essential, information on players and teams that you will likely need later on.

Once your SQL is properly set up, you then need to head on over to the Retrosheet download page at Jeff Zimmerman's sublime BaseballHeatMaps.com. You'll probably want to start with something small and simple, like the "Last 10 years" database. These files are unimaginably large and queries can often take a ridiculously long time to process, so I strongly recommend beginning with any of the smaller databases while getting your feet wet.

Once you're all downloaded, proceed to step two: Throw those nice Zimmerman brothers a bone and click the 'Donate' button. They provide this wonderful service year after year entirely free of charge, so it's only fitting that we try and give back as much as we can.

Step 3: Import the Database.

Step 4: Do a serious fist pump because your world is about to change.

Coding

Retrosheet is set up very differently from the Lahman database, so it may seem confusing or even intimidating at first, but I promise you it isn't.

To get an idea of all the amazing bits of information now available to you, click on the events file, and then click 'Content.' The events file is where everything is happening. This is where you want to be.

Each row in the events file is an 'event' that has taken place during a ball game. Most of the time this is the conclusion of a plate appearance. But it also includes, stolen bases, wild pitches, balks, etc. Basically, any time there is an out or a change in base state, it deserves its own row.

The first column game_id tells you in which game the event in question occurred. Each event within a game is given an event_id. And of course pit_id tells you who was pitching, while bat_id tells you who was at the plate.

I recommend indexing those four fields right away.

CREATE INDEX game_idx ON events (game_id);

CREATE INDEX event_idx ON events (event_id);

CREATE INDEX bat_idx ON events (bat_id);

CREATE INDEX pit_idx ON events (pit_id);

Not to be confused with event_id, the event_cd field then describes what actually happened. According to Retrosheet, there are 22 unique events that can happen in a baseball game. These are labeled 2-23 in the lkup_cd_event table. If we want to query, say, Zack Greinke's career strikeout total we simply add up, or "SUM" the amount of times the event_cd = 3, which represents strikeout events.

In SQL we do this using 'SUM' and 'IF':

SELECT pit_ID

, SUM(IF(event_cd = 3,1,0)) as K
, SUM(IF(event_cd = 14,1,0)) as BB
FROM events
WHERE pit_id = 'greiz001'

GROUP BY pit_ID;

This should return as follows:

pit_ID K BB
greiz001 1345 371

Note that when you go to verify these numbers at Baseball-Reference and Fangraphs, you are including post-season figures as well. We'll address the issue of seperating regular season games and post-season games later on.

The pit_ID is your pitcher obviously. Like Lahman, each player has an ID, though Retrosheet's IDs are slightly different. Typically, it is the first four letters of the player's last name followed by his first initial, then '001'.

Since Retrosheet does not come with a year_ID field, if you want to generate data for individual seasons you will need to extract the year from the game_id using "SUBSTR."

SELECT pit_id, SUBSTR(game_id,4,4) as Year
, SUM(IF(event_cd = 3,1,0)) as K
, SUM(IF(event_cd = 14,1,0)) as BB
FROM events
WHERE pit_id = 'greiz001'

GROUP BY pit_id, Year;

The result should then be this:

pit_id Year K BB
greiz001 2004 100 23
greiz001 2005 114 53
greiz001 2006 5 1
greiz001 2007 106 31
greiz001 2008 183 55
greiz001 2009 242 51
greiz001 2010 181 54
greiz001 2011 214 49
greiz001 2012 200 54

So now that you get the basic idea, let's move on to something more interesting. After all, the joy of mastering Retrosheet is all about creating strange splits you can't find anywhere else, isn't it?

If we want to find out how many strikeouts and walks Greinke issued with two outs, we write the same query except we add in the outs_ct information:

SELECT pit_id, SUBSTR(game_id,4,4) as Year
, SUM(IF(event_cd = 3 AND outs_ct = 2,1,0)) as K
, SUM(IF(event_cd = 14 AND outs_ct = 2,1,0)) as BB
FROM events
WHERE pit_id = 'greiz001'

GROUP BY pit_id, Year;

pit_id Year K BB K_2outs BB_2outs
greiz001 2004 100 23 34 12
greiz001 2005 114 53 36 26
greiz001 2006 5 1 3 1
greiz001 2007 106 31 37 14
greiz001 2008 183 55 58 19
greiz001 2009 242 51 83 18
greiz001 2010 181 54 64 24
greiz001 2011 214 49 75 25
greiz001 2012 200 54 80 28

What about with two outs in the 3rd inning? Use inn_ct:

SELECT pit_id, SUBSTR(game_id,4,4) as Year
, SUM(IF(event_cd = 3 AND outs_ct = 2 AND inn_ct = 3,1,0)) as K
, SUM(IF(event_cd = 14 AND outs_ct = 2 AND inn_ct = 3,1,0)) as BB
FROM events
WHERE pit_id = 'greiz001'

GROUP BY pit_id, Year;

Two outs, third inning, runner on first? No problem. Each base state is assigned a number, which can be found in the lkup_cd_bases table. start_bases_cd = 1 represents a lone runner on first at the beginning of the event:

SELECT pit_id, SUBSTR(game_id,4,4) as Year
, SUM(IF(event_cd = 3 AND outs_ct = 2 AND inn_ct = 3 AND start_bases_cd = 1,1,0)) as K
, SUM(IF(event_cd = 14 AND outs_ct = 2 AND inn_ct = 3 AND start_bases_cd = 1,1,0)) as BB
FROM events
WHERE pit_id = 'greiz001'

GROUP BY pit_id, Year;

And there you have it. You've just now queried how many strikeouts Zack Greinke has had with two outs in the third inning with a runner on first in each of his last nine seasons.

Obviously you could keep splitting this up as much as you'd like, adding dozens of qualifying caveats and requirements. Retrosheet presents you with a virtually endless supply of possibilities. They don't always have to be as inane as this, obviously. It's for you to decide how to use it.

Other helpful information that is now available to you.

  • Pitch Sequence: pitch_seq_tx
  • Score: home_score_ct, away_score_ct
  • Pitcher and batter handedness: bat_hand_cd, pit_hand_cd
  • Fielder IDs and position: pos2_fld_id, pos3_fld_id, etc.
  • Basrunner IDs: base1_run_id, base2_run_id, base3_run_id
  • Battedball types: battedball_cd

Bunts, Fouls, errors, double plays, triple plays, RBI, sacrifice hits, sacrifice flys, pinch runner IDs, you name it.

It's all in there.


More from the Saberizing a Mac series:

Sky's SQL Basics

#2 Baseball Databank

#3 Win Shares

#4 Pitch f/x

#5 wOBA

#6 FIP, FIP-, ERA-

#7 year-to-year data and r

#8 WAR

#9 Retrosheet

As always, don't be afraid to ask the most embarrassing of questions. You can leave a comment here or if you'd like you can even contact me personally by email or twitter. Remember that everyone starts somewhere.

. . .

Thanks to Retrosheet.org and Jeff and Darrell Zimmerman for enabling us with this wonderful tool.

James Gentile writes about baseball at Beyond the Box Score and The Hardball Times. You can follow him on twitter @JDGentile.