Learning R

Learning R: On your marks, get set, mutate

Started coding, had a break down. Bon appétit.

By Kenny Kelly@KennyKellyWords May 24, 2020, 9:00am EDT

In The Great British Baking Show, contestants are faced with three challenges: a signature bake, a show-stopper, and the technical. The signature bake is an opportunity to show off tried and true recipes while the show-stopper allows bakers to get buckwild on some blancmange. Each asks the bakers to stretch their abilities and creativity to the breaking point.

The technical on the other hand, isn’t concerned with what the bakers can imagine only in what they know. Bakers are given a recipe on an obscure pastry and must do their best to replicate it. Following a recipe when you’re not even sure what the end result is tricky enough, but the recipe is also as vague as possible. One step might just read: “Bake.” For how long and at what temperature isn’t given; the bakers should intuitively know this.

It’s an excellent test of baking knowledge, at least it is when the test is fair. In newer seasons, the well of esoteric baked goods has dried up, so the dishes and recipes have become increasingly inscrutable. Occasionally, no one manages to make a passable dish.

The technical challenge’s descent from test of knowledge and instincts into arcane torture by laufabraud didn’t feel so dissimilar from working through the exercises at the end of Analyzing Baseball Data with R chapter 3. For the past six weeks, I’ve been learning R to become a better baseball analyst, and if you want to get caught up, here’s where to start. I left off last week looking at Hall of Fame pitching after a prolonged battle with Retrosheet and Chadwick.

The first few prompts were manageable challenges. “Use the geom_point() function to construct parallel one-dimensional scatterplots of WAR.Season for the different levels of BP.group.” That’s easy enough. All one needs to know how to do is how to use the ggplot package to construct a plot. Even if the prompt had only said, “Construct parallel one-dimensional scatterplots…” without saying which function to use, it would have been obvious at this point.

The exercises slowly ramped up in difficulty by nudging me less and less. In a scatterplot of all Hall of Fame pitchers comparing their WAR per season and the midpoint of their career, I needed to figure out how to add data labels to only those from the 1800s. Simple. The code needed was close to the code I broke down a couple weeks ago. Typing in

ggplot(hofpitching, aes(WAR.Season, MidYear)) +

geom_point() +

geom_text_repel(data = filter(hofpitching, MidYear <= 1900, aes(WAR.Season, label = X))

gave me:

The second exercise stripped away more information from the prompts, but for anyone paying attention and taking ample notes, it should have been straightforward in how to tackle the challenges. One prompt read, “Collect in a single data frame the season batting statistics for the great hitters Ty Cobb, Ted Williams, and Pete Rose.”

There are few ways to go about this. The way I decided to do it involved a couple more steps and turned up more Pete Roses than the solution offered on GitHub. I decided to use the bind_rows() function while using the get_birthyear() function made by the authors of ABDR to create a data frame containing the great hitters with their birth year and playerID. From there, I used inner_join() to combine my data frame with the Batting data from the Lahman data base. All I had to do then was filter my chosen players into another data frame.

The only problem I ran into was that I had one too many Pete Roses. I forgot that Pete Rose’s son had a brief major league stint in 1997, so I had to figure out how to get Pete the Younger out of my data frame. My solution was to filter by birth date. It was an easy fix but figuring that out made me feel like this was starting to click.

At least that feeling lasted until the final prompts of the chapter. After using Retrosheet to get all of Mark McGwire’s and Sammy Sosa’s 1998 plate appearances into two separate data frames, I computed the number of plate appearances between home runs for each batter. Then I was asked to “Create a new data frame HR_Spacing with two variables, Player, the player name, and Spacing, the value of the spacing,” and that’s where I had a breakdown.

Combining two data frames isn’t hard—I did it earlier with inner_join(). The spacing data, however, wasn’t kept in a data frame. It was just a value. I tried inner_join() and I tried bind_rows(), but neither gave me the results that I wanted. I finally caved and looked up the solution online.

HR_Spacing <- rbind(data.frame(Player = “McGwire”, Spacing = mac.spacing),

data.frame(Player = “Sosa”, Spacing = sosa.spacings))

rbind()!? What the hell is rbind()?

Immediately, I flipped to the index to find this rbind() and found ranef(), rbinom(), and read_csv() but no rbind(). It was then that I felt like a contestant on the The Great British Bake Off starting down one of Prue’s oblique recipes for Æbleskiver. How was I supposed to know what that is?

Kenny Kelly is the managing editor for Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.

Learning R: On your marks, get set, mutate

Share this story

Share All sharing options for: Learning R: On your marks, get set, mutate

More From Beyond the Box Score

Loading comments...

Share this story

All sharing options for: Learning R: On your marks, get set, mutate