Dan Uggla decidedly has the most unlikely hitting streak of any player in history. Prior to the streak, it may have been an understatement to say that Uggla was having a miserable year. The Braves traded for him in the offseason, envisioning him as a large part of the team's power source for several years. The Braves then signed him to the second-highest contract by average annual value for a 2nd baseman in major league history. The year hadn't gone as expected, and on July 4th, 2011, he had a Mendoza-like slash line of .173/.241/.327. However, he then went on a tear in the form of a 33-game hitting streak. During the streak, his slash line went for .377/.438/.762, pulling his year totals up to less disappointing, although still a little less than respectable, levels.
Now, when comparing Uggla's first half of the season to his streak, it seems clear that the chances of Uggla getting a hit increased. However, it is a worthwhile question to wonder how much it changed, or even what Uggla's chance of getting a hit was at any single point of the season? Is batting average enough, or is more information needed?
Batting Average and Probability
With apologies to Laplace, the most important questions in baseball are, for the most part, really only problems of probability. Really, this is a plausible statement. If we knew the true probability that a player reaches base, we could have a much better handle on evaluating the talent level of a player.
Say you were given a whole season's worth of data on a batter. Then, you were asked what the probability that a batter gets a hit in one single at bat. What would you answer? A natural answer would be the player's batting average. It has several nice qualities to it: it's intuitive, it's the most likely value based on the data, and it will converge to the truth as the number of at bats goes to infinity, among others.
However, there are inherent problems with just using the player's batting average to date. A slow start to the season, such as Uggla in 2011, could drag down that value since all at bats are equally weighted. Basically, it doesn't take into account recent performance, essentially ignoring the possibility of the proverbial "hot hand."
Let's go back to Uggla for a moment. Going into his game on July 27th, his hitting streak was at 17 games. His season batting average was at .199, while his streak batting average was .328. Which value was closer to the true probability that Uggla would get a hit in that game, averaged over all possible pitchers? Most likely, it's somewhere in between.
The Concept Behind PresAvg
So, what we're looking for is some sort of weighted average of the batter's recent state and his overall state. This weighted average is notated as Present Average (Called PresAvg for the rest of the piece). To create the statistic, the recent state and overall states of a batter need to be determined. Overall batting average and recent batting average will be involved, but not entirely in the way that might be expected. So, let's start with the batter's overall state.
Batter's Overall State
In the creation of PresAvg, the batter's overall state can be represented by his batting average for the season to date. Yeah, it's that simple. Now on to the not as simple.
Batter's Recent State
Now, the batter's recent state is related to his batting average over the previous 7 games. It is not actually that value, but a weighted average of all 7-game batting averages from the start of the season to now (Whenever now is).
I can here you asking "Why can't we just use the batting average from the previous 7 games straight up?" Well, you can, and you will get a statistic from this, but it tends to be a much more volatile statistic than would be desired. You will see a plot of this later on
There's one other reason for not considering the lagged batting average straight up: Baseball is a very noisy game. By this, I mean that true values of probabilities and variables can be obscured by the high variation often seen within baseball. So how do we deal with this problem?
A side note to readers: I'm about to get a little mathematical and technical for the next couple of paragraphs, so if you want to skip to the next header for less technical stuff followed by the results, feel free.
In statistics, this type of problem can be addressed through the use of Hidden Markov Models (HMM). In this model, the true value of a hidden variable x is related only to its most recent predecessor. Then, a seen variable y exists which is related only to this true value of x. One such example, in math form, could be
x_{t} = φ x_{t-1} + ε_{t}
y_{t} = x_{t} + η_{t}
Where ε and η are random variables from defined distributions. In this case, we'll assume that φ is known. In our PresAvg case, our unknown x will be the batter's true recent state. The y are the seen lagged batting averages.
So, we want to estimate the value of x_{t} based on all our previous y values that we've seen. To do this, we make use of a statistical algorithm called a particle filter. I'll spare you the details and say that it creates a sample of potential true values of x. Then, we'll take the average of these potential true values is taken to get the batter's recent state.
Weighting Recent and Overall States
So, we have our overall state (OvAvg) and our recent state (RecAvg). Now, we need to determine the weights that will go into the weighted average. I tend to put more slightly weight on recent performance, but this is all changeable based on personal preference. Finally, after some monkeying around with the weights, the formula for PresAvg was defined as
PresAvg = 0.6 * RecAvg + 0.4 * OvAvg
So What Does PresAvg Really Represent?
I want to take a moment to discuss this question before moving on to applying this statistic to Dan Uggla. Really, most of the explaining here goes into what PresAvg is not.
- It is not an estimate of the probability that a batter gets a hit in any randomly chosen at bat. A better estimator of that would be batting average. The randomly chosen at bat lifted out of context basically breaks apart any sort of time relation with the previous time points, eliminating the need for evaluating the batter's recent state. That just leaves the overall state, which is best estimated by batting average.
- It is not a statistic to rank players by at the end of the season. Unlike most other statistics like batting average, on-base percentage, wOBA, and so on, PresAvg is not a cumulative statistic. At the end of season, it would just tell you a batter's probability of getting a hit at that time. So a really hot Ike Davis could conceivably have a higher PresAvg than a slumping Miguel Cabrera at the end of the season, although it's still unlikely.
- It is an estimator of the batter's probability of getting a hit at the present time based on their recent state and overall state. So this would be usable at any moment where you would see a batter's numbers over a recent set of games.
Now, there may be many that argue with certain parts of this statistic, specifically the weights in the weighted average and the form of the HMM specified above. Those who do dispute these please note that these are open to be changed, and are even likely to be changed as I more fully develop this model.
Dan Uggla and PresAvg
So let's look at Uggla's 2011 season and hitting streak through PresAvg. First of all, let's look at what we know can be seen: his overall season batting average by game, and his 7-game lagged batting average.
The season batting average by game is the OvAvg component of PresAvg. Also, we can see that the 7-game lagged batting average is highly variable, hence why we make use of the HMM. So, we run the HMM, and get our estimates of the RecAvg. This is shown by the red line in the graph below, with the 7-game lagged averages in gray.
So, we have our OvAvg and RecAvg portions, so all that's left is to put them into the PresAvg formula. Uggla's PresAvg can be seen in blue below, with important markers of his season and streak notated.
So we can see that once Uggla's streak started, his PresAvg started going up. After the streak ended, it did decrease back to what his season levels had been like. This is what I would want to see in such an estimator.
This can be applied to any player at any time during the season. While it does has some deficiencies (It's not entirely intuitive and there are potential problems dealing with injuries among others), it should give a better idea about than just batting average when it comes to the probability of a hit.