I had a modest goal entering the fourth week of the Learning R project: to put a data label on a scatter plot. Reader, I have good news. Behold!
That, my friends, is a scatterplot of every Hall of Fame pitcher’s strikeout and walk totals with a label to let you know that Nolan Ryan was an animal. It even has a title to let you know what you’re looking at.
Now, it took me roughly three hours of my Saturday afternoon (plus all the time spent in the previous weeks) to do something that I could have done in Excel in about four minutes, but this is a building block to make sure I never have to use Excel ever again. I don’t need you, Excel.1
I do, however, still need to copy code from Analyzing Baseball Data with R. I may have put a data label on a scatter plot, but I don’t necessarily know how I did it. So, I’m going to break down the code to see if makes more sense. This is the code that I used to build the plot and add the label and title:
ggplot(hofpitching, aes(x = BB, y = SO)) +
geom_point() +
ggtitle(“HOF Strikeouts and Walks”) +
geom_text_repel(data = filter(hofpitching, BB > 2000), aes(BB, SO, label = X2))
Going line by line, here’s what I think I’m doing. What follows is speculative and possibly wrong.
ggplot(hofpitching, aes(x = BB, y = SO)) +
ggplot is telling R to construct a graph of indeterminate type. The arguments that follow will define what goes into the graph. hofpitching is the data frame that R will be calling from. I need to separate that with a comma or else R won’t know that data frame’s name has ended. aes declares the aesthetics of the graph based on the arguments given. (x = BB, y = SO) is pretty straightforward. That’s assigning the x and y axes.
geom_point() +
This is a function to tell which type of graph I want. A scatter plot makes the most sense since I want to look at the relationship to strikeouts and walks.
ggtitle(“HOF Strikeouts and Walks”) +
ggtitle is a function to declare a title. I need to have the package ggplot2: the sequel installed to have it work. Otherwise, R is going to tell me “There’s no ggtitle function, dude.” “HOF Strikeouts and Walks” is what I want the title of the scatterplot to be.
geom_text_repel(data = filter(hofpitching, BB > 2000), aes(BB, SO, label = X2))
geom_text_repel is a function from the ggrepel package. This is what lets me add labels. data does, uh, well. I don’t know what it does, but it does something important. If I take it out, the code breaks, and I get an error message saying that ‘mapping must be created by aes().’ I don’t know what that means! Without code from the book, I would not have known that was needed, and as far as I know that’s the first time I’ve seen it. I tried to look up ‘data’ in the index, but as you might surmise, a book about data science is going to have a lot of references to data, so an index entry would be pointless.
The only information I found was in R itself. Rstudio has an underutilized feature (by me) call ‘Help,’ which gives more information on functions and whatnot. For data in the geom_text_repel function, it says, “A data frame. If specified, overrides the default data frame defined at the top level of the plot.” So, if I’m reading that correctly, it means that it’s enacting its function upon the data frame included in the filter() argument.
Anyway, filter is filtering data based on the arguments that I give it. hofpitching is the data frame R is calling from and BB > 2000 is the criteria that needs to be met to get a shiny data label. aes is again altering how the graph looks. I don’t know why I need to call BB, SO again. Actually, now that I check, I don’t need them. I took them out and the code still worked. label is telling R to add a label to every data point that matches the filtered criteria. X2 is the vector where player names are kept. The CSV that was read into R didn’t have a title for that column, so that was the default.
That makes a bit more sense after I type it all out. Of course, I still don’t know for sure why data needs to be included, only that the code won’t work if I don’t use it. I’m chalking this up as a victory though. I successfully added a data label to a scatter plot, and it only took me a month to do it.2
1) I’ll probably still need Excel.
2) Okay, really like four days because I’m lazy, and I only work on this on Saturdays.
Kenny Kelly is the managing editor of Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.
Loading comments...