Prediction is very difficult, especially about the future.
What do a three-time American League MVP and a Nobel-winning Physicist have in common? Both have had this quote attributed to them. Yogi Berra and Neils Bohr both supposedly have this quote to their name, although I'd probably defer to Neils Bohr on the subject of predictions.
Regardless who originated the quote, they're decidedly correct. Nothing is more difficult than predicting future events, especially when you don't have good data to go off of. Baseball is especially difficult to predict, because baseball is one of the noisiest sources of data.
Despite this, people still try to project what players will accomplish in the coming season. Around February-March, we see projections for players and teams starting to come out of the woodwork. Some involve complex formulas, while others merely involve general feelings about a player.
Part of the fun over the course of a season is seeing which players exceed their levels. Probably nothing is more talked about on television, radio, fantasy chats, and general conversation, than breakout/bust players. A fair question to ask is, "Who has had the best breakout season?" Or, more specifically, who has most exceeded expectations?
Now, an easy way to look at this is to merely scale the player's full-season WAR for their current number of plate appearances, and look at the differences. Now, this would be a little short-sighted, because players of different hitting types have different amounts of variability expected with their seasons. And anyways, what fun is it going with the simple answer?
So instead, why don't we look at the probability that a player exceeds their current level, given that their projection is correct. While this is can be difficult to look at directly, we are able to look at this through a few simulations.
Of course, to start, we need a projection system to start off from. For this, I decided to choose Dan Szymborski's ZiPS projections. Now, we need to discuss how we can get at the probability estimate we want to look at.
Many times before this, I've mentioned using Bootstrapping to assess the variability in an estimator. Well, I will again be using bootstrap-esque procedure, and I want to explain a little how the bootstrap method works in this case. Consider it "Bootstrapping For Baseball Applications." If you aren't interested, jump to the next header.
Bootstrapping: A Primer
Bootstrapping is a useful method when looking at the distribution or properties of an estimator when observing these properties directly may be difficult. Though looking at variability often needs multiple observations, i.e. multiple point estimates, we often only have one sample. Baseball is an excellent example of this, where we may want to know the distribution of, say, wOBA for a player, but can only observe one single season of wOBA.
However, our one sample has many data points which should be representative of the population as a whole. In baseball terms, a player's season is made up of individual plate appearances which should be representative of his overall talent level. Why couldn't we create seasons out of this sample?
This is what bootstrap does. It essentially takes our sample, and creates a new dataset by resampling with replacement from our real sample. So in baseball, we could create as many seasons as we want by sampling with replacement from the plate appearances seen in our season. Suddenly, we have a large number of point estimates, allowing us to estimate variances, probabilities, or anything we may want to investigate.
So, we'll use this bootstrap-esque technique to look at the probability that a player could exceed is expectations. To start with, we assume that the ZiPS projection is an accurate projection of the player's expected level. Then we sample from the plate appearances seen in the projection with replacement x times, where x is the number of plate appearances for the player at this point. Then, we calculate the player's WAR by adding the player's bootstrap RAA to his actual UZR, Replacement, and Positional levels. Then, we can look at the number of times a player exceeds his current level, out of 10,000 created seasons.
The Most Unexpected Players of 2013
So, of the 253 players who had at least 150 PAs by this past Monday, who exceeded their projection by the largest amount? And what was the probability that we'd see a better season from this player? In the table below, we have the player, their actual WAR, and the probability they exceeded their current level based on preseason projections.
|Player||Actual WAR||P(Higher WAR)|
|Alejandro De Aza||0.8||0.5675|
So to add to all his accomplishments this year, Chris Davis is the most unexpected player of 2013 so far, and it's not even close. The season he is having is roughly 20 times less likely than any other player, at least when compared to their respective projections. Other players who exceeded expectations by far were breakout players Everth Cabrera, Josh Donaldson, and Carlos Gomez.
On the other end of the spectrum fall the season busts. Some are down there because of injuries, other just because they haven't produced so far. Jeff Keppinger barely edges out Ike Davis for the most underwhelming, with Danny Espinosa and Matt Kemp close behind.
One final comment to make about these projections is about how difficult projections really are. We can assess this by looking at how the probabilities from the above table are distributed.
The better the projection, the closer the probability of a better season is to 0.5. As we can see, the distribution of probabilities is pretty close to uniform throughout [0,1]. Just a reminder about how difficult these projections can be to get correct.
All statistics courtesy of Fangraphs. Statistical work done in R.