clock menu more-arrow no yes

Filed under:

Learning R: Rubber ducking

Sometimes, slowing down and taking a step back is necessary.

2011 Oakland Athletics v Boston Red Sox - Game Two Photo by Jim Rogash/Getty Images

Last week, my journey of learning R hit its first major snag. There have been dozens upon dozens of minor snags thus far, but the compatibility issues I ran into were the first to stop me in my tracks. To recap, Chapter 7 of Analyzing Baseball Data with R has you install the pitchRx package which parses XML files from Baseball Savant, but in the four years since the second edition was published, MLBAM made a change that borked the package.

Currently, the way forward is treacherous and murky. I could skip ahead to a chapter that doesn’t use Statcast data and circle back. I could downgrade my version of R to one that’s compatible with the baseballr package and do my best to continue with a different toolset. I could embark upon my own project with the tools I have at my disposal.

But I think what I’m going to do is actually take a step back and review the work from the previous chapter. I mentioned that I hadn’t been taking rigorous notes, so some of the code went over my head. It’s useful for me to look at code line by line, argument by argument and break down what is being done. That way, the code congeals in my brain instead of just leaking out.

It’s not just me. Real programmers do this line-by-line breakdown, too. The difference is that programmers do this when they encounter problems with their own code rather than when trying to understand someone else’s. Also, a rubber duck is occasionally involved.

Some programmers keep a rubber duck on their desk, so that when their code breaks, they can explain what their code is supposed to do to the rubber duck. Along the way, the programmer will stumble upon the fault in their code.

So, I’m going to take another look at how this Miguel Cabrera swing chart was constructed.

This is the code used to generate that plot:

k_zone_plot <- ggplot(cabrera_sample, aes(x = px, y = pz)) +

geom_rect(xmin = -0.947, xmax = 0.947, ymin = 1.5,

ymax = 3.6, fill = “lightgray”, alpha = 0.01) +

coord_equal() +

scale_x_continuous(“Horizontal location (ft.)”,

limits = c(-2, 2)) +

scale_y_continuous(“Vertical location (ft.)”,

limits = c(0, 5))

k_zone_plot +

geom_point(aes(color = factor(swung))) +

scale_color_manual(“Swung”, values = c(“gray70”, crcblue),

labels = c(“No”, “Yes”))

Let’s break it down.

k_zone_plot <- ggplot(cabrera_sample, aes(x = px, y = pz)) +

k_zone_plot <- determines that everything that comes after it will constitute data called k_zone_plot. ggplot(cabrera_sample, calls the ggplot() function on the cabrera_sample data frame which is a random sample of 500 pitches thrown to Miguel Cabrera in 2009. aes(x = px, y = pz)) + declares that the x-axis of the graphic will represent the vector px and the y-axis will represent pz. px is the horizontal pitch location while pz is the vertical pitch location.

geom_rect(xmin = -0.947, xmax = 0.947, ymin = 1.5,

ymax = 3.6, fill = “lightgray”, alpha = 0.01) +

geom_rect() is a function from ggplot2 that will create a rectangle where the corners are set by the (xmin, xmax, ymin, ymax) values. In this case, the values are set at the corners of the strike zone. fill = “lightgray”, means that the rectangle is filled with light gray. alpha = 0.01 defines the opacity of the graphic. Changing this value to 0.5 means you can actually see it.

coord_equal() +

After reading the R documentation for this function 14 times, I think it means that one unit on the x-axis will be the same length as one unit on the y-axis. This is useful if values in the x and y-axis are measured in the same unit.

scale_x_continuous(“Horizontal location (ft.)”,

limits = c(-2, 2)) +

scale_y_continuous(“Vertical location (ft.)”,

limits = c(0, 5))

scale_x_continuous() and scale_y_continuous() are both functions which allow for the renaming of axis titles. It also sets the bounds of the graphic to include all the places where a pitch could be thrown and a batter would swing at it.

All of the above code just creates an object called k_zone_plot in the global environment. To actually create the graphic, the following code needs to be entered in a separate command.

k_zone_plot +

geom_point(aes(color = factor(swung))) +

k_zone_plot simply calls the object. geom_point defines that this chart will be a point graph. (aes(color = factor(swung))) means that each point will be returned in a different color depending on the factor of whether Cabrera swung. In the cabrera data frame from which these data originate, there’s a vector called swung which is represented by either a 0 or a 1. 0 means he didn’t swing; 1 means he swung.

scale_color_manual(“Swung”, values = c(“gray70”, crcblue),

labels = c(“No”, “Yes”))

The scale_color_manual() function allows one to create their own color scale. In this case, there will only be two values, but this allows for points to be colored based on their value along a spectrum making it easier to visualize. “Swung”, names the key. values = c(“gray70”, crcblue), creates the bounds of the spectrum from low to high (or didn’t swing to swung). labels = c(“No”, “Yes”)) names the values in the key again in the order of low to high.

Entering that gives us our nice, shiny swing map. Now, I wish I could say that I understand everything from Chapter 7, but there’s all this Loess smoothing that goes over my head. I still don’t fully get how the %+% operator works. I even looked at my solution for the last exercise, and it looks like it’s written in hieroglyphs. I have no recollection of writing this, but I guess I must have. Hopefully next week feels more like progress and less like Memento.


Kenny Kelly is the managing editor of Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.