clock menu more-arrow no yes mobile

Filed under:

Downloadable Run Distribution Fun!

We usually talk about baseball statistics in terms of gross averages.  But statistics deals with probabilities, not clairvoyance, and so it is always important to think of how values are spread across time or skill level.  This sort of thinking is prevalent in Baseball Prospectus' PECOTA projections and the many iterations of DIPS theory.

The sort of distribution in which I am most interested is run distribution.  I used run distributions to analyze the 2005 AL West race (Part One and Part Two) late last year.  My conclusion was that the way a team distributes its runs scored and allowed can have consequences on its wins and losses in a way that manifests itself as deviations from projected Pythagorean wins.  

I realize that my earlier article reflected my extreme West Coast Media Bias, so I thought it would be fun if everybody could see the run distributions for their favorite teams.  Thankfully, a combination of Retrosheet, Excel, and MATLAB makes generating these plots quite easy.  So I've generated run distribution plots for both runs scored and runs allowed for all major league teams from 1998 - the latest expansion - to 2004. (Whither 2005?  I have to wait until Retrosheet has its 2005 game logs up.)

But that's not all!  On each run distribution plot, I have also included the Weibull distribution curve that theoretically describes run distribution.  For those of you unfamiliar, the Weibull curve shows the expected distribution of run scoring and run prevention.  Deviations from the curve can manifest themselves as deviations from projected Pythagorean wins.  Before I present the run distribution plots, I thought it would be helpful if we go over some Weibull/Pythagorean basics, but feel free to skip over the mathematical junk if you wish.

What the hell is a Weibull Distribution and why should I care?
Steven Miller of Brown University has shown that a three-parameter Weibull distribution describes the run distribution of teams quite well.

In English, the frequency f with which a team scores (or allows) x runs is equal to a long messy equation with three parameters, α, β, and γ.  The real magic of the Weibull distribution is that it can be used to derive the Pythagorean theorem - and the parameter &gamma is the same as the exponent in the Pythagorean theorem.

Where did you get the parameters for the Weibull curve, smartypants?
I used the following parameters to generate the Weibull curves:

β = -0.5. This is a mathematical trick that Professor Miller's paper discusses in some detail.  I won't rehash it here, but you can check the original paper if you are interested.

γ = (Runs/Game)^.287.  This is the the Smyth-Patriot model, and this parameter is calculated for the entire major leagues for each year.  It is probably more correct to calculate γ separately for each team, but I think the gains are marginal.

α is computed so that the observed average runs scored (or allowed) is equal to the Weibull-determined average.  By taking the first moment of the Weibull distribution, the average μ can be computed as

where Γ is the well-known gamma function.  Thus

and it is calculated separately for both runs scored and runs allowed.  This is not as robust as minimizing the mean-square error, but it sure is quicker.

Why didn't you separate National and American Leagues when calculating γ?
I didn't see a need to.  If you can come up with a convincing explanation as to why I should, please let me know.

How do I read the plots?
Here's a sample:

Each plot shows one team's distributions for runs scored (top) and runs allowed (bottom).  The x-axis is runs and the y-axis is frequency.  The open circles represent actual data and the line represents the Weibull curve.  Each season comes in its own zipped file which you may download and includes distributions for all 30 teams as well as a league-wide distribution which has the name Dist_ML_XX.bmp where XX = last two digits of the year.

What does it all mean?
I don't know.  Maybe nothing.  But it is known that deviations from the curve can impact a team's record differently depending on its location.  For example, a team that has scores runs more often than Weibull predicted at 7+ runs but less often between 2-5 runs isn't doing itself any favors, as there is a decreasing marginal utility to additional runs.  I discussed some consequences in my above-linked 2005 AL West article; let me know if you think of some more.

Is this available numerically?
Yes.  There are links below to Run Distribution Reports for each year that show the league exponent (γ, displayed in the reports as "g"), aggregate win frequency by runs scored, and Major League averages for α (displayed as "a_RS_league").  For each team, the actual run distribution and Weibull distribution are shown, as well as α for runs scored and allowed (shown as "a_RS" and "a_RA," respectively).

You're my favorite BtB author, and you are also very attractive.  Where have you been?
You'll thank me when you see all the good that polymer brushes grafted to semiconductors does for mankind.

Who's that funny-looking dude?
Welcome to Beyond the Boxscore.

What's that noise?
Welcome to Beyond the Boxscore.

Where's that awful smell coming from?
Welcome to Beyond the Boxscore.

Are you going to let me see the damn things?
Sheesh.  Pushy, pushy.  Here they are:

Run Distribution Report 1998 (best viewed in your browser)
Run Distribution Plots 1998 (right-click to save zipped file)

Run Distribution Report 1999
Run Distribution Plots 1999

Run Distribution Report 2000
Run Distribution Plots 2000

Run Distribution Report 2001
Run Distribution Plots 2001

Run Distribution Report 2002
Run Distribution Plots 2002

Run Distribution Report 2003
Run Distribution Plots 2003

Run Distribution Report 2004
Run Distribution Plots 2004

Comments on the plots, both in substance and style, are welcome, as are suggestions for how to get rid of this damn athlete's foot.

Update [2006-2-23 20:26:1 by salb918]: All the plots are in .bmp format, so they should be viewable by just about anybody.