clock menu more-arrow no yes

Filed under:

Learning R: Diving back in

After a week away, all this code looks like gobbeldygook.

MLB: JUN 29 Indians at Orioles Photo by Mark Goldman/Icon Sportswire via Getty Images

I’ve never really been fond of the expression “like riding a bicycle.” It implies that cycling isn’t a skill to be honed. Riding a bike is a simple task, but it still requires conditioning if you want to do it recreationally and practice if you want to do it competitively (or if you want to be a show-off and ride with no hands). A bike was my main form of transportation throughout my twenties, and there were some stretches of my life where I’d ride over 15 miles a day. I haven’t ridden my bike in 2020, and if I got on today, my legs would feel like Jell-O, my lungs would burn, and I would hate every minute of it.

Getting back on the bike after eight or nine months would still be better experience than going back to R after a week off.

My body might not be accustomed to the rigors of cycling, but my brain still remembers how to do it. You use your legs to turn the pedal and your arms to point where you want to go. Neither my brain nor my body were ready to get back into R Studio. I couldn’t remember the syntax for basic mutate functions nor could I remember where I saved things in my directory. Any muscle memory I had developed for R’s logical operators was just gone.

This wasn’t helped by the fact that the last time I studied I hardly took any notes. I went from highlighting every line like Alexis in Schitt’s Creek to highlighting one sentence a page. Here’s an example of something I highlighted: “The spread() function helps us display the results in a wide rather than long format.” No annotation, no notes, no nothing. Do you know what that sentence means? I sure don’t.

To make matters worse, the solutions for the exercises in Chapter 6 of Analyzing Baseball Data with R aren’t included in the book’s GitHub repository. So if when I got stuck, I had two options: move on or bang my head against a wall until I figured something out.

I wish I could say that I gave up. I would have saved myself a lot of time and maybe I could have spent my holiday weekend not trying to see who drew the most pickoffs in 2016. However, I love banging my head against a wall and right now, celebrating America feels even grosser than it usually does. I was more than happy wiling away my Independence Day with Retrosheet.

After hours and hours of attempts with nothing to show but error messages, I figured it out. I think. Like I said, the solutions aren’t online, so I had no way to check my work.

As mentioned, one of the exercises in Chapter 6 asks:

Identify the baserunners who, in the 2016 season, drew the highest number of pickoff attempts when standing at first base with second base unoccupied.

Sounds easy enough. The game files from Retrosheet include pitch-by-pitch data including pickoff attempts by both the pitcher and catcher. It’s just a matter of finding a way to isolate events when there’s a runner on first and no runner on second, counting the number of pickoffs, and grouping by baserunners.

I was working from a data frame the book has you create, but this should work from any Retrosheet data frame created with the parse.retrosheet.pbp2() function.

The first thing I did (which worked) was create a new vector “pickoffs” that was simply the pitch sequence with everything that wasn’t a pickoff by the catcher removed. Retrosheet uses a code for every event, and the gsub() substituted everything that wasn’t a pickoff with nothing.

> pbp2016 %>%

+ mutate(pickoffs = gsub(“[*.23>BCFHIKLMNOPQRSTUVXY]”, “”, PITCH_SEQ_TX)) -> pbp2016

Next, I had to then count the number of pickoffs in each appearance. This was done with the nchar() which counts the number of characters in a string.

> pbp2016 %>%

+ mutate(pickoff_attempts = nchar(pickoffs)) -> pickoffs

Once I had the number of pickoffs in each plate appearance, I needed to filter it down to events which matched my criteria: runner on first, no runner on second. This was done by using the filter() function to return rows which contained a string in BASE1_RUN_ID (ID of the baserunner on first) and no string in BASE2_RUN_ID (ID of the baserunner on second). Then I grouped by baserunner using group_by() and added together pickoff attempts and created a new data frame, “pickoff_leaders”.

> pickoffs %>%

+ filter(BASE1_RUN_ID > ‘’ & BASE2_RUN_ID == ‘’) %>%

+ group_by(BASE1_RUN_ID) %>%

+ summarize(total_pickoffs = sum(pickoff_attempts)) -> pickoff_leaders

If I did this right, that means the runners who drew the most pickoff throws were Jonathan Villar, Charlie Blackmon, Mookie Betts, Jean Segura, and Francisco Lindor. Those names all make sense to me. They’re all fast runners who are threats to steal.

Pickoffs Drawn 2016

Baserunner Pickoffs
Baserunner Pickoffs
Jonathan Villar 236
Charlie Blackmon 203
Mookie Betts 174
Jean Segura 174
Francisco Lindor 172

Something I realize I didn’t account for was pickoff attempts by the catcher if the throw was going to third. Retrosheet codes all pickoffs by the catcher in the same way regardless of where the throw is going so I could either eliminate catcher backpicks entirely or filter it down to only include situations where there’s a runner on first and no other runners. I figured pickoffs to third are rare enough that it doesn’t affect the data much.

Now that I figured out who drew the most pickoff throws in 2016, do I feel like I understand baseball better? No! Of course not! Do I feel like I understand R better? A little bit!


Kenny Kelly is the managing editor of Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.