clock menu more-arrow no yes

Filed under:

Learning R: Mercury is in Retrosheet

A baseball superstition has bled into R.

BBA-AS-RANGERS Max Faulkner/Fort Worth Star-Telegram/Tribune News Service via Getty Images

My friend and I have this superstition to never make fun of a player’s name. It started when we were watching her favorite team, the A’s, play against the Rangers on July 29, 2010. Taylor Teagarden stepped into the box, and we immediately started making fun of his name because it’s the least imposing moniker a baseball player could have. What didn’t help his cause was that entering the game, Teagarden was hitting just .037 on the season. As we cracked our jokes, Teagarden launched an 0-2 pitch over the wall for a two-run homer. The next time the A’s and the Rangers met, he socked another dinger. And the next day, he another one.

Teagarden finished the year slashing .155/.259/.338 for a 57 wRC+, but he destroyed the A’s. Teagarden had 20 plate appearances against Oakland—the most against any of his opponents that season—and he went 6-for-19 with three home runs and a double. His 223 tOPS+ against the A’s was all because we made fun of his name.

It happened again with other players, too. Two seasons ago, I thought to myself that JB Shuck sounds like an oyster restaurant, and then he went 4-for-7 with two doubles against my Giants. The last time my friend and I spoke, it was for her to let me know that the A’s broadcasters were poking fun at Rowdy Tellez’s name before he got rowdy and sent a ball over the wall.

This curse has even extended into my learning R. When I looked up when that first Teagarden home run was, I didn’t do it in my usual way of finding games I’m trying to remember. Usually, I look up game logs on Baseball Reference and scroll down to find a boxscore that sort of matches what I’m looking for. That method works for certain specific things. It wouldn’t be too hard to find a 2010 Teagarden game where he hit a home run against the A’s, but this method is limited to my memory.

Instead, I looked up Taylor Teagarden home runs on Retrosheet using R, and it only took me about four hours to get Retrosheet data into Rstudio.

Getting Retrosheet files into a workable format is one of the things students of Analyzing Baseball Data with R have the most trouble with. The book lays out some very simple instructions for getting the Retrosheet files in csv format.

1) Create a retrosheet folder in the working directory

2) Download some special software tools called Chadwick

3) In R, type in source(“parse_retrosheet_pbp.R”)

4) Then, type in parse_retrosheet_pbp() with the season you want in the parentheses.

Do these steps, the book says, and you’ll wind up with two fine looking csvs.

There was a fair amount of user error in my struggles to Retrosheet into R. The first mistake I made was not downloading Chadwick. The parse_retrosheet_pbp function uses the program from Chadwick to create the csvs, and it’s necessary to work. However, the book only says you should download Chadwick and my first thought was “I’m not letting a Chad with a monocle into my home.”

I also ran into some case sensitivity issues. The book tells you to lower case your folder names, and I prefer to uppercase the first letter because I’m not a monster. Then I failed to properly set the working directory.

Some things, however, weren’t my fault. That parse_retrosheet_pbp.R file was no where to be found in the master files uploaded by the authors on GitHub, so I had to find the code for the function elsewhere. That wasn’t so much of a problem once I realized this function I was trying to source didn’t exist on my computer. I just wish it didn’t take me 45 minutes to come to that realization.

Even after I resolved all my mistakes and got my two tidy csvs in my unzipped folder, my code wasn’t working. I double-checked and triple-checked my code for typos or mistakes, but it was flawless. I should have been returning a simple line graph that charted the home run race between Mark McGwire and Sammy Sosa, but all I was getting was an error: BAT_ID not found.

Just as I was about to give up, I inspected a data frame I was trying to call. The data1998 table was supposed to hold all the play-by-play from that season, but it held nothing. The data frame was empty. I went into my Retrosheet retrosheet folder and opened my all1998.csv that Chadwick so lovingly crafted for me, and I found it empty as well.

That seemed wrong, but what do I know? I figured a csv with every play from an entire season would be massive, so maybe the data was just, I don’t know, hidden.

After another hour or so of Googling and brute forcing the code, I finally came across this blog post where a commenter also had problems with empty csvs. Turns out, Chadwick version 0.7, which I installed, doesn’t work for this specific purpose. I downgraded to Chadwick 0.6, and everything worked like a dream.

There’s probably a complicated technical reason why Chadwick 0.7 didn’t work, but I know the real reason: I made fun of its name. I’m sorry, Chadwick. I was wrong. I should know better by now.


Kenny Kelly is the managing editor of Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.