I’m not quite sure where to begin with this week’s entry of Learning R. Previously, my inspiration drew from the intense frustration of getting error message after error message or having Chadwick bork on me. That’s why I spent a quarter of last week rambling about the exercises feeling like late-season technical challenges on The Great British Bake-Off and half of the week before talking about a superstition inspired by Taylor Teagarden. This week, however, was relatively pain-free. It didn’t remind me of any major trials or tribulations; it was just kind of relaxing.
Chapter 4 of Analyzing Baseball Data with R doesn’t introduce a lot of new functions. Instead, it shows how one can synthesize what they’ve learned so far to do some actual #analysis. By now, calling data frames, filtering out the unwanted information, and mutating new metrics should be familiar. The major new tool added to the box is lm() which creates a linear model useful for correlated metrics.
The hardest part is just understanding the math behind it. Chapter 4 mostly deals with Pythagorean record. Before a recent FanGraphs prep post, I didn’t know how Pythagorean winning percentage was calculated, only that it was based on run differential. It’s a simple formula (R2 / (R2 + RA2), but keeping a constant formula across all eras of baseball might be a bit reductive.
That exponent is going to vary depending on offensive environment. For instance, outscoring your opponents by 10 runs in the 60’s was worth about 0.2 wins more than it was in the 70’s. To find a more accurate exponent, you need to take the logarithm of the equation.
As someone who majored in English, it’s been about 14 years since I’ve had to think about logarithms. While it wasn’t necessary to know all the math to replicate the code or do the exercises, it wouldn’t have done me any good to copy things without grasping what I’m doing. To get through the chapter with some vague understanding of what I was actually calculating, I had to turn to a Cliff Notes page for Algebra II students.
Though the math portion of Chapter 4 reminded me that my arithmetic abilities have regressed to my ninth-grade levels, the coding actually felt good. This was the first time I felt like I got to write code of my own. I used R to compare the cumulative WAR of Buster Posey, Yadier Molina, Russell Martin, and Brian McCann recently, but even then I was copying code from the book and just working from a different CSV.
Everything up until now has felt like following a recipe, but now I’m able to come up with my own dishes. They might just all be different variations of stir fry and pasta, but they’re my stir fry and pasta.
Kenny Kelly is the managing editor of Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.