clock menu more-arrow no yes

Filed under:

A million monkeys at a million spreadsheets: 2015's projections systems in review, part two

New, 1 comment

Comparing the performance of the major projection systems in 2015, concluding with the pitchers. Predicting baseball: still hard, especially for pitchers.

Andrew H. Walker/Getty Images

When we left off last week, things were looking bleak for the major projection systems, at least in terms of their projecting hitters in 2015. Only Steamer was consistently more accurate than Marcel (an extremely basic projection system and the baseline every other projection system should aim to beat), and for virtually every stat and subset of players, Marcel wasn't substantially worse than the mainstream systems. Even among rookies, whom Marcel only "projects" to be league average, PECOTA, Steamer, and ZiPS were unable to convincingly beat it.

This week, I'm repeating the process, but looking at projections for pitchers rather than hitters. Everything I said in the previous article applies to this one, so if you haven't read that, now might be a good time. As a quick review here, the projection systems I'm comparing are PECOTASteamerZiPS, and Marcel. The sample is composed of all pitchers with more than 0 innings pitched (IP) who had a projection from all three of PECOTA, Steamer, and ZiPS. That gives me 549 players and 38,899.1 IP. Innings pitched isn't as good a measure of playing time as PA is for batters – I had to remove poor Nick Greenwood, formerly of the Cardinals, since he faced two hitters without recording an out – but it's what all the projection systems have, so it's what I'm using.

With this sample I scaled the projected and actual stats to league average. This allowed me to focus on the projections' relative accuracy (a decision I discuss in more depth in the previous article) and calculate the mean absolute error of each projection system, weighted by each player's IP. The statistics I chose to focus on were K/9, BB/9, HR/9, and ERA. Again, this is somewhat limited by the stats that are included in all four of the projection systems, but it also covers a wide range of skills. I'm a fan of FIP, and that's reflected in my choice of all three of the true outcomes, but including ERA should cover contact-management, as well as providing a measurement of overall value.

First, I considered the whole sample.

ALL PLAYERS K/9+ BB/9+ HR/9+ ERA+
PECOTA .152 .220 .297 .221
Steamer .124 .207 .285 .195
ZiPS .146 .223 .316 .218
Marcel .144 .216 .300 .215

Lower numbers indicate lower average error, and therefore greater accuracy.

The first thing that leaps out from this table is Steamer's continued excellence when compared to the other systems. It leads each category by a substantial margin, as it did for the hitters, cementing its status as the most accurate system of 2015 for both pitchers and hitters. I didn't expect Steamer to repeat. If accurate projection of hitters and pitchers requires different attributes from a system, as seems likely, there's no reason to think a projection system that's good at the former will be good at the latter. It could just be coincidence, or it could be that there's something consistent about the way Steamer projects both pitchers and hitters that gives it an edge over the field.

The second thing I noticed is that, again, Marcel is right in the thick of things, as its projections were more accurate than at least one of PECOTA or ZiPS in every category. Interestingly, there's no category that's resistant to Marcel's extremely simple approach, or at least more resistant than others. I had expected Marcel to perform well in comparison to the more complex systems when predicting a category like HR/9, which fluctuates greatly from year to year and should generally be regressed heavily, but to struggle in K/9, where incorporating something like velocity data (which Marcel ignores) seems like it should provide a big advantage. Turns out I was wrong; Marcel is decent-to-good across every category.

Finally, it seems like the overall performance of a pitcher is, in general, harder to predict than for that of a batter. The best pitcher K/9 accuracy and the best batter K% accuracy are very similar, as are the best BB/9 and BB%, but the best "wOBA" figure is more than twice as accurate as the best ERA figure. This might have to do with the general volatility of pitchers, or the degree to which their ERA depends on the defense behind them (while a batter's production is almost entirely dependent on his skill alone). In any case, it's interesting.

As with last week, my next step was to look at the accuracy of the average of the systems in each category, and observe the change in the accuracy of the average when a given system was left out. Averaging disagreeing systems is a common tactic, and this allows for an evaluation of that, as well as seeing which systems add the most unique information as compared to the rest of the field.

UNIQUENESS TESTING K/9+ BB/9+ HR/9+ ERA+
Average .131 .206 .287 .203
Change without PECOTA .002 .000 -.001 .002
Change without Steamer -.006 -.002 -.007 -.006
Change without ZiPS .000 .001 .002 .001
Change without Marcel -.001 -.003 .000 -.002

Negative figures indicate added accuracy.

The first thing to note is that, as with batters, the simple average of all the systems does a pretty good job. It actually has the best accuracy in the BB/9 category, and is not far off the leader in each of the other three. Averaging the systems is a very reasonable approach that seems to guarantee accuracy, if you think Steamer won't continue it's supremacy next year.

Second, Steamer's better performance in 2015 is really hammered home, as the large changes in the accuracy of the average at predicting K/9, HR/9, and ERA when Steamer is removed show that it was breaking from the consensus frequently, and providing unique insight as a result. Each of the other three systems had relatively small figures, indicating substantial overlap between their predictions and the predictions of the other systems.

As with the batters, the next thing I did was move on to the pitchers with no major-league experience who received projections of precisely league-average performance from Marcel.

ROOKIES K/9+ BB/9+ HR/9+ ERA+
PECOTA .170 .294 .342 .267
Steamer .159 .301 .355 .269
ZiPS .142 .333 .395 .271
Marcel .206 .311 .336 .261

Lower numbers indicate lower average error, and therefore greater accuracy.

Like the analysis of rookie projections in part one, these results come from a much smaller sample of innings (2,516.2, about 6.5% of the total sample) and so are more likely to reflect randomness instead of the actual accuracy of the systems.

Even so, the results here are incredible. Marcel, with the lowest average error in ERA, was the most accurate system at projecting the overall performance of pitchers with no major-league track record. This is a similar result as in part one, when it was the most accurate in projecting the overall performance of rookie batters. I'm repeating myself by now, but as a reminder: Marcel barely projected these players at all, pegging them only for league-average performance. Each of the other systems departed from that, and tried to identify which new pitchers would be better or worse than average, and was in turn less accurate. Baseball: incredibly hard to predict, especially when it comes to rookie pitchers.

Unlike for the hitters, however, Marcel's performance really fluctuated across categories, leading ERA and HR/9 but falling to third in BB/9 and last in K/9 by a fairly large margin. That result could be random, or it could be that there's something in the minor league data that is being considered by PECOTA, Steamer, and ZiPS that gives them an advantage over Marcel in predicting walks and strikeouts, but not home runs and ERA.

The next subgroup I looked at is another obvious one: starters. I defined them as pitchers who spent no less than half of their innings as a starter. This is also a good time to bring up an added complication of reviewing the pitcher projections as compared to the hitters. It's well-known that starting is harder than relieving, and so a given pitcher should receive different projections based on whether he's expected to start or not. This is obviously not relevant to Marcel, or at least not in the same way. In an ideal world, I wouldn't include a projection system's ability to correctly guess how a pitcher will be used, but there's no easy way to strip this out. Each of the systems does project how many of a pitcher's games will be starts, but not how many of his innings will come as a starter, so there's no way to prorate his performance and normalize it.

Before showing the results for the starters, I wanted to compare the abilities of the systems to guess whether a player will start or relieve. I treated this as binary, with a system projecting a player as a starter if at least 1/3 of his games came as a starter, and a player actually starting if at least half his innings came as a starter. The average error was calculated the same way, again weighted by IP.

STARTER TESTING Starter Flag
PECOTA .077
Steamer .077
ZiPS .053

Under those definitions, PECOTA and Steamer guessed correctly whether a player would start or not for 92.3% of the innings in the sample, while ZiPS was at 94.7%. This is reassuring; all three of those figures are high enough that I don't think the average accuracy of the performance projections will be impacted that heavily. The occasional mis-classification of someone like Travis Wood as a starter rather than a reliever shouldn't be a major problem.

With that in mind, how did the systems do at projecting starters, who made up 69.0% of the innings of the total sample?

STARTERS K/9+ BB/9+ HR/9+ ERA+
PECOTA .132 .181 .249 .191
Steamer .105 .167 .244 .165
ZiPS .127 .187 .280 .199
Marcel .125 .175 .252 .188

Lower numbers indicate lower average error, and therefore greater accuracy.

First, each system is more accurate in each category than they were for the whole sample, which should not be a surprise. Starters are more predictable than relievers, both inherently and because of the greater sample size of innings they provide in a given season. Also interesting is that there's almost no change in the order of the projection systems from the overall sample, with PECOTA and ZiPS swapping third and fourth in ERA accuracy being the only difference. That would seem to indicate projecting starters and relievers rewards basically the same traits of the systems, as no system is substantially better at one than the other.

Beyond those broad conclusions, Steamer continues to lead each category, often by a large margin, with Marcel frequently just behind. It's getting repetitive, but I still find it impressive that Steamer had such convincing success in 2015 for both batters and pitchers and across statistics, and that Marcel was even with or ahead of all three of these systems virtually every step of the way.

Finally, I again repeated from my analysis of the hitters, and looked at each system's accuracy when it came to projecting excellent pitchers, who I defined as those who ended the season with 4.0 WAR or more. As with the batters, there's an inherent playing time requirement in that threshold. Here, it serves to eliminate all relievers and many non-durable starters, so keep that in mind when reviewing the results. Overall, innings from these players made up 10.4% of the total.

WAR>4.0 K/9+ BB/9+ HR/9+ ERA+
PECOTA .112 .134 .167 .133
Steamer .107 .168 .171 .145
ZiPS .083 .118 .169 .151
Marcel .122 .185 .153 .156

Lower numbers indicate lower average error, and therefore greater accuracy.

Finally, a different set of results! The first thing I noticed was ZiPS's excellence when predicting walk and strikeout rates of these excellent players, beating the other systems by very, very large margins in both categories. Also interesting is that Steamer, which had been the most accurate across most subsets and stats, isn't the most accurate in any category here, and is fairly far off from earning the top marks. Instead, it's PECOTA that was most accurate at predicting the overall performance of the great players, as evidenced by the low error rate for ERA.

Marcel came in first when predicting home run rate, but suffered in each of the other categories. I think this makes a lot of sense, as Marcel's main driver is regression. Home run rate varies wildly, and heavy regression of HR/9 is likely appropriate for anyone. For the other stats, though, a pitcher that got more than 4.0 WAR in 2015 is probably resisting regression more than the average player, which makes Marcel's approach less likely to work for them. Again, this is retrospective, and this doesn't show that PECOTA was necessarily the best at identifying who the great pitchers of 2015 would be, just that it was most accurate at projecting the pitchers who ended up being the best.

That concludes this review of projections in 2015. What should be clear by now is that there's enormous room for improvement, as Marcel's simple formula has continued to put up good results since its inception in 2001. There's a website (or baseball team) with a stack of money for the first person(s) who can consistently and soundly beat it. Maybe you're that person!

. . .

Henry Druschel is a Contributor at Beyond the Box Score. You can follow him on Twitter at @henrydruschel.