Projections are a constant in modern baseball analysis. Gone are the days when players were evaluated only on past performance; now, players as well as trades, contracts, and roster constructions are evaluated with some combination of one or more quantitative forecasts. There are many, and they don't all agree (as BtBS's own Ryan Romano has been documenting recently), but they tend to do, on the whole, a much better job than merely going by one's gut.
That brings up two related questions. Just how good a job do the projections do? And how do they do not on the whole but on more specific subsets of players? Are certain skillsets or profiles easier for a system to project than others? Other writers have taken whacks at these questions in the past — this 2014 article from The Hardball Times evaluated the performance of several systems in 2014, for both MLB as a whole and groups of players sorted by experience and age — and I wanted to do something similar for 2015.
I'll be considering the main public projection systems: PECOTA, Baseball Prospectus' proprietary system; Steamer, created by Jared Cross, Dash Davidson, and Peter Rosenbloom and hosted at FanGraphs and Razzball; and ZiPS, created by Dan Szymborski and also hosted at FanGraphs.
As a baseline, I'll also include the charmingly named Marcel the Monkey Forecasting System, which is overseen by Tom Tango. There's not much to oversee, however; the whole point of the Marcels is to provide an extremely basic forecast. It utilizes only a simple age adjustment and each player's past three seasons of performance, weighing last year most heavily and three years ago least heavily. Players with no MLB history are, by design, projected at precisely league average. For a project like this, Marcel is invaluable, since it establishes the bar that any other projection system should aim to clear.
First, I took all batters with at least one plate appearance in 2015 and threw out anyone who didn't receive a projection from all three of PECOTA, Steamer, and ZiPS, leaving an overall sample of 545 players. There are alternatives to this process you might prefer — the most obvious would have been to give each unprojected player a league-average forecast, like Marcel — but I wanted this evaluation to focus on the figures the various systems actually produced, not their scope.
The next thing I did was take the projected and actual stats for each player and scale them to league average, for two reasons. One, it could be argued that this is what is actually relevant, and that it's more important to know how well a system projected relative levels of performance than whether it correctly projected a continued rise in strikeouts league-wide. Again, this isn't the only way — there's something interesting in the fact that Steamer projects a leaguewide OBP of .316, while ZiPS is at .299 and PECOTA at .300 — but it's what I chose. Two, and more importantly, it allows for comparisons of projections across statistics. I wanted to look for different strengths of the systems in different areas (maybe PECOTA can predict OBP with incredible accuracy, while Steamer's forte is SLG), and scaling everything to league average makes those comparisons very easy.
Then it's a matter of comparing each player's projected statistics to their actual statistics and combining those across players for each projection system. Formally, it's
for players numbered 1 to n, also known as the mean weighted absolute error. Multiplying each player's figure by their number of plate appearances eliminates the need for a minimum playing time requirement and makes projecting regular players appropriately more impactful than projecting backups.
I chose five statistics to consider: BB%, K%, OBP, SLG, and "wOBA". Unlike real wOBA, this version was constructed using just walks, singles, doubles, triples, home runs, and plate appearances, as those were the stats that each system projected, but it still provided a good measurement of overall offensive production. I used the 2015 weights, available at FanGraphs. The five stats taken together cover a wide range of offensive skills — plate discipline, contact, power, and combinations thereof — without being overly specific.
Much of this methodology is either inspired by or taken directly from this 2007 thread from Tom Tango's website and a 2007 Nate Silver article at Baseball Prospectus. The original version unfortunately appears to be lost to the depths of the internet, but an archived version can be seen here. With the technical stuff out of the way, how did each system perform?
Lower numbers indicate a lower average error, and therefore greater accuracy.
There's a lot to unpack here, some expected and some very surprising. First, there appears to be a fairly clear order of ease of projection: on-base percentage, followed by "wOBA," slugging percentage, strikeout rate, and walk rate. Some of that is fairly expected, but not all. I expected slugging, a stat that depends almost entirely on power output, to be one of the more difficult categories to predict, since power can seemingly appear and disappear at random, but that's evidently not the case. I also wouldn't have guessed that walk rate would be more than twice as hard to guess than on-base percentage, both because I thought it would be fairly easy to project and because I thought it would be closely tied to on-base percentage.
Next, the 2015 editions of ZiPS and PECOTA appear to have fallen short of the baseline set by Marcel. Marcel was more accurate than both systems in each of the five categories, though sometimes by only a small margin. Conversely, Steamer was the most accurate in each of the five categories, usually by a pretty healthy amount. It's important to emphasize that this is based only on one year of data, so it definitely doesn't show that Steamer is the best system, only that Steamer did the best in 2015. I'd also say it shows that projecting baseball is really difficult, and improving on Marcel's simple approach is very tough.
There also seems to be some indication of overlap in the traits of a system that make it project these varied stats well. While they obviously influence each other — slugging and overall offensive output are not independent of strikeout rate, for example — it's still remarkable that the order of accuracy among the systems was almost entirely unchanged across all five stats.
The next thing I did was see which system added the most unique information. I did this by calculating the change in accuracy when using the average of all four systems versus the average with one system left out. If, for example, ZiPS and Marcel have virtually the same opinion on every player, there won't be much difference between using the average of all four and the average of just PECOTA, Steamer, and Marcel. Alternately, if ZiPS breaks from the consensus on a lot of players, there will be a bigger shift when adding it into the average. This also has the upside of testing the projection accuracy of the simple average — taking the mean is a common tactic when faced with disagreeing projections, and I'm interested to see how it holds up.
|Change without PECOTA
|Change without Steamer
|Change without ZiPS
|Change without Marcel
Negative figures indicate added accuracy
The first thing to note is that the average of all four systems does a pretty good job when compared to any individual one. It's second-best in each of the five categories and sometimes quite close to the top marks set by Steamer. If one system is consistently the best, year-in and year-out, then using that system makes sense, but if the accuracy of each system fluctuates from year to year (as seems likely), using the average seems like a good way to ensure results that are at least solid.
The other results are not too surprising, especially in light of the first table. Marcel and Steamer had the best results, so they added accuracy to each category, while PECOTA and ZiPS reduced it. Marcel, as the most basic system, also brought very little in the way of uniqueness and changed the overall accuracy by the smallest amount in four of the five categories. Steamer, which was also consistently the most unique, especially in projecting strikeouts seems to have brought a lot of new and accurate information to the average. Whether by design or by luck, 2015 was a good year for Steamer.
I then looked at the systems' performance at projecting certain groups of players, starting with players without any major league plate appearances. This is both the hardest group to forecast, due to their lack of major-league data, and in many ways the most important to get right, because of the uncertainty around their performance. All 56 of these players received Marcel projections of league-average performance.
The first thing to remember is that these results are based off a much smaller number of plate appearances than the previous tables, 8,581 versus 167,505, and so are even less likely to be indicative of actual trends instead of mere random variation. That said, there are still some interesting nuggets to pull out.
First, while Marcel takes a serious hit, it still does a remarkably good job. Remember, for all these players, the only "methodology" Marcel had was predicting precisely league-average performance, and it still did the best at predicting slugging and "wOBA", though it was the worst at walk and strikeout rate. Let me repeat that: for players with no major-league track record who debuted in 2015, the best projection of their overall offensive value came from the system that gave every member of the group a league-average projection. That's incredible!
PECOTA and ZiPS also do better than they did at predicting the entire population, with PECOTA taking the top spot in walk rate and OBP and ZiPS in a virtual tie for first with Marcel at "wOBA". Unsurprisingly, the projections are less accurate across the board, though the differences in walk rate and OBP are relatively small, especially for the non-Marcel projections.
The next group I isolated was power hitters. League-average isolated slugging (the purest measurement of power output) in 2015 was .150; just over 20% of plate appearances leaguewide went to hitters with an ISO higher than .200, so that's the threshold I chose.
Despite the reduced plate appearances, the accuracy of the systems generally improved here. This appears to indicate two things: Power hitters are likely a bit easier to project than the overall MLB population, and the decline in accuracy among rookies in the above table is a result of more than just the smaller sample.
The improvement wasn't uniform, however, across systems or statistics. Each system was at least slightly worse at projecting walk rate for the power hitters, and results were mixed when projecting strikeout rate and on-base percentage. The big improvements came at projecting slugging and "wOBA", but only for three of the four systems. PECOTA, ZiPS, and Marcel all improved by about .020 in both categories, a substantial figure, while Steamer's accuracy declined by .022 for slugging and .007 for "wOBA." Again, I'd hesitate to read too much into this, but this is at least a suggestion that power hitters are generally easier to project, especially their power and overall production, and that Steamer has more difficulty with these types of hitters than other systems.
The final group I chose to isolate was really, really good players. There's obviously some question of how to define that; I went with players with a total fWAR of at least 4.0. That obviously has an implicit playing time requirement — it's hard to get to 4 WAR without at least 300 plate appearances — so this group differs from the overall pool in several ways. It's made up of only 39 of the 545 players in the overall pool but 14.7 percent of the total plate appearances.
The first thing I notice in this table is that, as with the other groups, Marcel still does a remarkably good job, hanging close to the other systems in every category and doing quite well in some. If there's one takeaway from this article, it's that baseball is extremely hard to predict, and using Marcel is going to be almost as good as using the more complicated systems in the vast majority of situations.
The performance of the other systems varied a lot across the different categories. Steamer again excelled at K%, where it was more accurate than each other system by roughly .020. However, it fell behind in on-base percentage, slugging, and "wOBA" by similar or greater margins, categories where ZiPS instead led the field. Overall, accuracy generally improved, and by large margins in certain categories, perhaps indicating there's something about great players that makes them more consistent or easier to predict.
Curiously, Steamer was an exception to that. In on-base percentage, its accuracy fell by .004 versus an average increase in accuracy of .036 among the other systems; in slugging, by .015 compared to an average increase of .035; and in "wOBA", by .003 versus an increase of .036. Steamer, despite its excellent accuracy overall in 2015, seemed to have some trouble projecting excellent players.
I could keep slicing the sample into different subgroups, but that stops adding new information past a certain point. The broad conclusions are clear by this point: Steamer did very well in 2015, though it struggled with certain subgroups; PECOTA and ZiPS didn't have great years; Marcel, despite its incredible simplicity, sets a high bar for any system to clear among each group, even rookies.
This strikes me as a really empowering conclusion. A constant feeling in baseball analytics is that there is nothing you can do that couldn't also be done by dozens of other people, but here's an area where the combined work of several great minds working for more than a decade did not improve much against the baseline hurdle, or at least they didn't in 2015. There's a lot of room for improvement, and that translates into opportunity. Next week, I'll repeat this process for pitching projections, and see if the more advanced systems fare any better in their fight against Marcel.
. . .
Henry Druschel is a Contributor at Beyond the Box Score. You can follow him on Twitter at @henrydruschel.