clock menu more-arrow no yes

Filed under:

Learning R: A tidy, well-lighted place

The foundation is set. It’s time to build something.

MLB: Texas Rangers-Globe Life Field Event Jerome Miron-USA TODAY Sports

When I first began the Learning R project, I didn’t know the full scope of what R could do. I knew that Serious Statisticians used R, so if I wanted to be a Serious Statistician, I would have to learn it. I approached it in much the same way as I did coming out of high school thinking I wanted to be a Serious Writer. Serious Writers read and wrote like Hemingway, I thought erroneously, so I set aside Douglas Adams and dove into those hills like white elephants, those snows of Kilimanjaro, those suns that also rise without really ever understanding why I was doing it.

If you approach learning that way, there’s going to be friction and friction leads to chafing. In my brief-but-damaging “Write like Hemingway” phase, I was taking the wrong lessons. The point of reading Hemingway isn’t to learn that to write, you must sit at the Google doc and bleed, and it sure as hell isn’t “Write drunk; edit sober.”

The lessons of Hemingway are ultimately “That thing you’re writing is entirely too long,” and “Verbs > Adjectives.” It took me a long time to realize that writing like Hemingway doesn’t mean seeing the world the way he did. I was writing, but I wasn’t creating.

There’s been less chafing in my foray into R. Progress is more visible in STEM than in writing. Sure, there’s been some annoyance at the way things are written (looking at you, R documentation), and some technical issues that were only occasionally my fault. Along the way, however, I’ve added tools to my shed. I can mutate(), I can inner_join(), I can build run expectancy matrices, and evaluate balls and strikes calls by umpire. It’s easy to look back and see how far I’ve come.

Like my Hemingway phase, I may be coding, but I’m not creating. I have replicated a lot of the projects from Analyzing Baseball Data with R, but aside from looking at bases-loaded, no-outs situations and Barry Bonds in 0-2 counts, I haven’t done much individual projects. After reading through Chapter 9, which covers simulating a season, I think I’m ready to venture forth with a simulation of my own.

Shortly after the 2020 season began, MLB officially announced it would be expanding the playoffs to 16 teams. The teams with the best two records from each division will earn a spot. In addition, the two remaining teams in each league with the best records will enter as wild cards. If this system were in place last year, the Diamondbacks, Mets, and Cubs would have earned a playoff spot in the National League. The worst record among them was the Cubs at 84-78.

In the American League, Cleveland, Boston, and Texas all would have made it, too. The Rangers finished at 78-84, and they would have had a shot at winning the World Series. How much of a shot? Well, that’s what I aim to figure out.

For next week, I’m going to build a simulation for this wonky playoff structure and see how likely it is that a sub-.500 team comes away with a piece of metal at the end of this season. Come back next Sunday. I promise there will be less rambling about Hemingway.


Kenny Kelly is the managing editor of Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.