Dan Uggla, Probability, and PresAvg

Scott Cunningham

Dan Uggla's 33-game hitting streak was not expected. But was his probability of getting a hit during the streak really as low as his batting average indicated?

Dan Uggla decidedly has the most unlikely hitting streak of any player in history. Prior to the streak, it may have been an understatement to say that Uggla was having a miserable year. The Braves traded for him in the offseason, envisioning him as a large part of the team's power source for several years. The Braves then signed him to the second-highest contract by average annual value for a 2nd baseman in major league history. The year hadn't gone as expected, and on July 4th, 2011, he had a Mendoza-like slash line of .173/.241/.327. However, he then went on a tear in the form of a 33-game hitting streak. During the streak, his slash line went for .377/.438/.762, pulling his year totals up to less disappointing, although still a little less than respectable, levels.

Now, when comparing Uggla's first half of the season to his streak, it seems clear that the chances of Uggla getting a hit increased. However, it is a worthwhile question to wonder how much it changed, or even what Uggla's chance of getting a hit was at any single point of the season? Is batting average enough, or is more information needed?

Batting Average and Probability

With apologies to Laplace, the most important questions in baseball are, for the most part, really only problems of probability. Really, this is a plausible statement. If we knew the true probability that a player reaches base, we could have a much better handle on evaluating the talent level of a player.

Say you were given a whole season's worth of data on a batter. Then, you were asked what the probability that a batter gets a hit in one single at bat. What would you answer? A natural answer would be the player's batting average. It has several nice qualities to it: it's intuitive, it's the most likely value based on the data, and it will converge to the truth as the number of at bats goes to infinity, among others.

However, there are inherent problems with just using the player's batting average to date. A slow start to the season, such as Uggla in 2011, could drag down that value since all at bats are equally weighted. Basically, it doesn't take into account recent performance, essentially ignoring the possibility of the proverbial "hot hand."

Let's go back to Uggla for a moment. Going into his game on July 27th, his hitting streak was at 17 games. His season batting average was at .199, while his streak batting average was .328. Which value was closer to the true probability that Uggla would get a hit in that game, averaged over all possible pitchers? Most likely, it's somewhere in between.

The Concept Behind PresAvg

So, what we're looking for is some sort of weighted average of the batter's recent state and his overall state. This weighted average is notated as Present Average (Called PresAvg for the rest of the piece). To create the statistic, the recent state and overall states of a batter need to be determined. Overall batting average and recent batting average will be involved, but not entirely in the way that might be expected. So, let's start with the batter's overall state.

Batter's Overall State

In the creation of PresAvg, the batter's overall state can be represented by his batting average for the season to date. Yeah, it's that simple. Now on to the not as simple.

Batter's Recent State

Now, the batter's recent state is related to his batting average over the previous 7 games. It is not actually that value, but a weighted average of all 7-game batting averages from the start of the season to now (Whenever now is).

I can here you asking "Why can't we just use the batting average from the previous 7 games straight up?" Well, you can, and you will get a statistic from this, but it tends to be a much more volatile statistic than would be desired. You will see a plot of this later on

There's one other reason for not considering the lagged batting average straight up: Baseball is a very noisy game. By this, I mean that true values of probabilities and variables can be obscured by the high variation often seen within baseball. So how do we deal with this problem?

A side note to readers: I'm about to get a little mathematical and technical for the next couple of paragraphs, so if you want to skip to the next header for less technical stuff followed by the results, feel free.

In statistics, this type of problem can be addressed through the use of Hidden Markov Models (HMM). In this model, the true value of a hidden variable x is related only to its most recent predecessor. Then, a seen variable y exists which is related only to this true value of x. One such example, in math form, could be

xt = φ xt-1 + εt

yt = xt + ηt

Where ε and η are random variables from defined distributions. In this case, we'll assume that φ is known. In our PresAvg case, our unknown x will be the batter's true recent state. The y are the seen lagged batting averages.

So, we want to estimate the value of xt based on all our previous y values that we've seen. To do this, we make use of a statistical algorithm called a particle filter. I'll spare you the details and say that it creates a sample of potential true values of x. Then, we'll take the average of these potential true values is taken to get the batter's recent state.

Weighting Recent and Overall States

So, we have our overall state (OvAvg) and our recent state (RecAvg). Now, we need to determine the weights that will go into the weighted average. I tend to put more slightly weight on recent performance, but this is all changeable based on personal preference. Finally, after some monkeying around with the weights, the formula for PresAvg was defined as

PresAvg = 0.6 * RecAvg + 0.4 * OvAvg

So What Does PresAvg Really Represent?

I want to take a moment to discuss this question before moving on to applying this statistic to Dan Uggla. Really, most of the explaining here goes into what PresAvg is not.

  • It is not an estimate of the probability that a batter gets a hit in any randomly chosen at bat. A better estimator of that would be batting average. The randomly chosen at bat lifted out of context basically breaks apart any sort of time relation with the previous time points, eliminating the need for evaluating the batter's recent state. That just leaves the overall state, which is best estimated by batting average.
  • It is not a statistic to rank players by at the end of the season. Unlike most other statistics like batting average, on-base percentage, wOBA, and so on, PresAvg is not a cumulative statistic. At the end of season, it would just tell you a batter's probability of getting a hit at that time. So a really hot Ike Davis could conceivably have a higher PresAvg than a slumping Miguel Cabrera at the end of the season, although it's still unlikely.
  • It is an estimator of the batter's probability of getting a hit at the present time based on their recent state and overall state. So this would be usable at any moment where you would see a batter's numbers over a recent set of games.

Now, there may be many that argue with certain parts of this statistic, specifically the weights in the weighted average and the form of the HMM specified above. Those who do dispute these please note that these are open to be changed, and are even likely to be changed as I more fully develop this model.

Dan Uggla and PresAvg

So let's look at Uggla's 2011 season and hitting streak through PresAvg. First of all, let's look at what we know can be seen: his overall season batting average by game, and his 7-game lagged batting average.

Ovavg

The season batting average by game is the OvAvg component of PresAvg. Also, we can see that the 7-game lagged batting average is highly variable, hence why we make use of the HMM. So, we run the HMM, and get our estimates of the RecAvg. This is shown by the red line in the graph below, with the 7-game lagged averages in gray.

Recavg

So, we have our OvAvg and RecAvg portions, so all that's left is to put them into the PresAvg formula. Uggla's PresAvg can be seen in blue below, with important markers of his season and streak notated.

Presavg

So we can see that once Uggla's streak started, his PresAvg started going up. After the streak ended, it did decrease back to what his season levels had been like. This is what I would want to see in such an estimator.

This can be applied to any player at any time during the season. While it does has some deficiencies (It's not entirely intuitive and there are potential problems dealing with injuries among others), it should give a better idea about than just batting average when it comes to the probability of a hit.

Btbs-twitter-insert_medium

X
Log In Sign Up

forgot?
Log In Sign Up

Please choose a new SB Nation username and password

As part of the new SB Nation launch, prior users will need to choose a permanent username, along with a new password.

Your username will be used to login to SB Nation going forward.

I already have a Vox Media account!

Verify Vox Media account

Please login to your Vox Media account. This account will be linked to your previously existing Eater account.

Please choose a new SB Nation username and password

As part of the new SB Nation launch, prior MT authors will need to choose a new username and password.

Your username will be used to login to SB Nation going forward.

Forgot password?

We'll email you a reset link.

If you signed up using a 3rd party account like Facebook or Twitter, please login with it instead.

Forgot password?

Try another email?

Almost done,

By becoming a registered user, you are also agreeing to our Terms and confirming that you have read our Privacy Policy.

Join Beyond the Box Score

You must be a member of Beyond the Box Score to participate.

We have our own Community Guidelines at Beyond the Box Score. You should read them.

Join Beyond the Box Score

You must be a member of Beyond the Box Score to participate.

We have our own Community Guidelines at Beyond the Box Score. You should read them.

Spinner.vc97ec6e

Authenticating

Great!

Choose an available username to complete sign up.

In order to provide our users with a better overall experience, we ask for more information from Facebook when using it to login so that we can learn more about our audience and provide you with the best possible experience. We do not store specific user data and the sharing of it is not required to login with Facebook.

tracking_pixel_9351_tracker