When we left off a few weeks ago, the projection systems had displayed mixed abilities at projecting hitters in 2016. Steamer came out on top, followed by PECOTA and ZiPS, and the Marcel the Monkey forecasting system – designed to be the simple, baseline system that other projection systems have to beat – did remarkably well, keeping up with, and beating all three of its more complicated peers.
The obvious follow-up to that article is this one, on pitchers. It comes after a three-week gap because 2016 Marcels for pitchers, unfortunately, weren’t publicly available anywhere that I could find. But Marcel – “so simple a monkey can do it” – is also so simple that I can do it, salvaging this article.
Many, many thanks to Tom Tango for coming up with the idea behind the system, making the methodology public, and for helping with some of my questions.
If you’re curious about any of the quirks of the various systems, I would suggest checking out the article from last week, and this article from last year, which went into the different approaches of each system. There are real, substantive differences between all of them, and if one or more systems show themselves to be consistently more accurate than the others, figuring out why is the necessary next step.
My methodology was very similar. I identified the players who were projected by all three systems (Marcel projects everyone), threw some number of innings in 2016, and also was projected in the same role (starter or reliever) that they actually filled. That’s not necessarily a required step, but two systems might agree completely on how good a pitcher is, but differ on their stats based only on whether they think that pitcher will start or relieve. The ability of the different systems to predict how a player will be used isn’t something I’m interested in testing, and this seemed like the cleanest way of making it irrelevant. It also eliminated a relatively low number of players, so it shouldn’t fundamentally affect the results in any case. The pool consists of 474 pitchers with 33,032 2⁄3 innings collectively, 76 percent of the total innings pitched in 2016.
I looked at each player’s projection in four categories: ERA, K/9, BB/9, and HR/9, which seem to cover the important bases of pitching ability and are covered by every system. I scaled that projection to league average for each projection system (because the projection’s guess of what league average will be each year is also something I think is not worth testing) and calculated the difference between projection and actual performance, and between the average of all four projections and actual performance. I took the mean of those differences across players, weighted by innings pitched, to get an average error for each system. (Like last time, thanks to this old Nate Silver post and this Tom Tango thread for the basic methodology.)
First up, the results for the entire collection of pitchers.
Steamer leads the pack, as it did for the 2016 hitters. It is the best system in three of the four categories, and not by a small margin in any of them (particularly HR/9). For the hitters, PECOTA was consistently in second place, though rarely by much of a margin; here, while it takes first place in the one category that Steamer doesn’t, there’s generally more of a gap between PECOTA and Steamer.
ZiPS struggled with the pitchers, also as it did for the 2016 hitters. ZiPS was the least accurate in all four categories, often by large margins. Again, I’m hesitant to read too much into a single season of results, but as this ordering starts to replicate itself, for both pitchers and hitters and over multiple years, the conclusion that ZiPS is meaningfully behind the other systems (and that Steamer is ahead) looks more and more reasonable.
Marcel did its job, again. It doesn’t lead any categories, but neither does it bring up the rear in any. It remains difficult to beat consistently or convincingly, despite its incredible simplicity.
The simple average was very accurate. The average of the four systems was never the most accurate, but it also was never off the pace by a large margin. In many ways, it was more consistent than any of the individual systems, as it was never worse than the second-most accurate. From a methodological perspective, there’s no reason to think that a simple average of four distinct systems would yield good results, but as with the hitters, it seems to work well.
What “accurate” is in each of the four categories is itself somewhat interesting, though unsurprising. Home runs are the most difficult to predict by a substantial margin, followed by runs allowed, then walks, and then strikeouts. That matches our beliefs about the consistency and predictability of each of those statistics, but it’s nice to see that reflected in these results.
Next, I sliced the set of pitchers into several subgroups, as I did with the hitters, to try to see if any of the systems have a notable “specialty” in some distinct category. I began with rookies, the players that Marcel “projects” to be exactly league average and, presumably, the hardest players to project for every system, given their lack of track record. This is also a relatively small subset of the whole group – 42 pitchers covering 1,952 innings – which should also push the error bars up.
As expected, the errors do go up across the board, and Marcel falls off the pace somewhat. It still manages to be not-terrible, however, which is shocking, given how little it does to project these players. For the hitters, ZiPS performed relatively well on the rookies, but it doesn’t do so here, staying in last for two categories and beating only Marcel for a third. It was the best system at projecting home runs, the category with the highest average error, but was still beat out by the average of all four.
Particularly notable is PECOTA’s dominance at projecting rookie ERA, and to a lesser extent, BB_9. It beat the average convincingly, and every other system by a huge margin, when it came to the single most important category – how many runs these pitchers allowed – which should be a big plus in its favor. Again, this is one year of results, and a very small subset of the whole group, but it’s certainly worth noting.
The next subgroups I looked at were starters and relievers, with 172 pitchers/21,462 2⁄3 innings and 302 pitchers/11,570 innings respectively. Here are the summaries for each:
The systems are all more accurate on the starters, which isn’t a surprise. Marcel really drops the ball on the relievers, which is interesting but also not particularly surprising; given their lack of track record, the systems that look to their fundamentals instead of just their performance over the previous few years probably should be more accurate.
Beyond that, however, there’s not much of a pattern, which in and of itself is a bit surprising. None of the systems really distinguished themselves when it comes to either group; the most striking trend is the degree to which ZiPS struggled with starters.
Finally, to try to find some interesting pattern, I looked at how the systems did at projecting the very best pitchers, defined as any pitcher in the top 20 starters by ERA for any of the systems. Twenty-seven pitchers fit that definition, and they threw 4,467 2⁄3 innings in 2016. As with the hitters, this is an important group for a system to project accurately, as they’re the players who headline massive trades and can make (or break) a team’s season.
But alas, there’s still not much here. The average might have provided less utility, and Marcel does well in a couple categories; given these players’ long track records, that might make sense. But none of the systems did well across the board; Steamer came the closest, doing an excellent job with ERA and K/9, a fine job with HR/9, and only struggling with BB/9.
But the overall conclusion would seem to be that these systems don’t have very distinct specialties – they’re good/bad/mediocre in all subgroups. For 2016 at least, it was Steamer that was good, PECOTA that was mediocre, and ZiPS that was bad. Again, this is not a particularly robust set of results, based as it is on a single year, but it’s tempting to perhaps draw some conclusions about the methodological differences between the systems. ZiPS is the one that most wholeheartedly embraces DIPS, and it may be that it’s an outdated or inaccurate conception of pitching. PECOTA is based on player comps, while Steamer most resembles a more complex version of Marcel, incorporating more variables and tweaks on the margins. Based on these results at least, it would seem that Steamer and Marcel’s approach is not only the basic one, but the most accurate one.
The final takeaway: Marcel is beatable, but doing so consistently remains very difficult. It gets us across the easy 90 percent of projecting performance; the other systems try to cross that final 10 percent, and that’s the hard part.
Henry Druschel is the Managing Editor of Beyond the Box Score. You can follow him on Twitter @henrydruschel.