Park factors are really just the worst.
I don’t mean in principle; the concept of adjusting player stats, either retrospectively for the sake of comparison or prospectively for the sake of projection, to account for the ballpark in which they play is a good idea. The problem lies in the math behind them. I’ve spent the better part of my free time for the last two or three weeks reading everything I could find written about park factors and adjustments, all the way from a post at Patriot’s old Tripod site to an article in an academic journal, and I’ve come to a few conclusions. First, it’s impossible to get park factors exactly correct. Second, it’s important to consider the tradeoffs you have to make if and when you decide to use a particular version. Third, there might be something wrong with me for spending so much time on this.
The basic idea behind calculating a park factor is easy enough, right? Usually in the context of runs per game (though sometimes by components like singles, HRs, etc), it’s a team’s production at home over production away. Boom. done.
Well, no, not really. That’s affected pretty heavily by the skill level of the team. Better include the opponents, too. So, for a given team, it’s that team’s production at home PLUS its opponents' production in those games divided by that team’s production on the road PLUS its opponents' productions in *those* games. There. That makes sense. Done.
Well, no, wait. What about interleague games? Adding/removing the DH from lineups will change the hitting skill level. We should probably exclude those. And what about pitchers batting? Given their lack of skill, will they be affected by a park the same way? Maybe we should leave their PAs out. And what about right- versus left-handers? It’s not like parks are symmetric across a line in center field, that probably makes a difference. And what about ground ball hitters versus fly ball hitters? They’ll be affected differently, too… but also, maybe the GB/FB ratio is affected by parks, too… You see my point, hopefully. This gets very complicated, very quickly, and all of the above is focused mostly on sample issues, not even really getting into problems with the formulas people use.
Still, for whatever foolish reasons drive any of us to dive headfirst into inconsequential things, I’ve decided on a method that I think makes sense, and will report on that here (and walk you through the creating of park factors for isolated power (ISO), the creation of which spurred this whole thing). As any thorough researcher would, I relied heavily on the works of those that came before me to teach me how to do this, and to give me a starting point from which to branch out. If you want to read more than I discuss here about the theories and calculations behind all this stuff, please see the following sites:
Patriot’s Park Factors
Baseball Reference Park Adjustments
FanGraphs Library - Park Factors
Park Factor Thoughts by TangoTiger
High Boskage - Baseball Data Normalization
Park Effects by Jim Furtado
The Philosophy of Park Factors by Colin Wyers:
Okay. Bearing all that in mind, and probably also some resources I forgot to mention, here's what I did to create park factors for isolated power. There's a LOT of methodological detail ahead, which I think some of you might want to see, but if you don't, I respect that - just skip to the results.
Using MySQL to query the Events table of my Retrosheet database (complete years only, so 1974-2013), I created a spreadsheet of year, home team, away team, batting team, league, at bats, handedness, and ISO. Using that information, for each home team I found the ISO (separated by batter handedness) of that team and that team's opponents. To each of these I applied a regression term, which I'll explain in the next paragraph. After making an adjustment to the opponents' number (that will be described later) I combined the two, proportionally weighting the opponents figure by how many opponents there were - so, say for Atlanta in 1974, the number is 1/12th Atlanta ISO, 11/12ths opponent ISO.
Regression, both in the above procedure and anywhere else I mention it, was based on the idea of reliability of statistics measured by Cronbach's alpha, which I was introduced to via Russell Carleton. He helped me out a bit when I was trying to figure it out, which I'm very grateful for. His articles on the subject can explain it much better than I ever could, so I direct you that way if you'd like to know more about it. I measured alpha separately for home and away ISO, using each season as an individual test subject while excluding the two strike years in the sample (as well as interleague and pitcher ABs). I truncated the seasonal data where necessary in order to make all years data lines equal length, and used the 'psy' package in R to actually calculate the alpha. For home teams, it came out to 0.668 in 1092 at bats for righties and 0.642 in 721 at bats for lefties; for away teams, it was 0.486 in 696 at bats and 0.467 in 405 at bats, respectively.
Since Cronbach's alpha is effectively a split-half correlation coefficient, I was able to use the value I found to determine how much regression should be included in my calculations based on the formula R = AB/(AB + X), where R is the alpha I found, AB is the number of at bats (per season) corresponding to the alpha, and X is the amount of at bats to use in regression. Note that AB is half of the actual number of at-bats used because of the split-half nature of Cronbach's alpha; I could have used the Spearman-Brown prophecy formula to get a predicted alpha for the entire set of at bats, but the math works out identically either way. Bottom line, for home right handers 271 ABs of league average ISO was added, for home lefties 200 ABs, for away righties 368 ABs, and for away lefties 231 ABs.
The adjustment to opponents' ISO I'd mentioned attempts to account for a sample difference across the different home teams. Weighting by quantity of opponents means that the batters contributing to the measured ISO will be evenly distributed across all league teams (or close to it, though not exactly even because of the unbalanced schedule); pitchers’ contributions, however, will then be coming disproportionately highly from the home team’s pitchers. To fix this problem, I decided to multiply the opponents’ ISO term by a term defined as the league average ISO allowed divided by the team’s pitchers’ regressed ISO allowed. There might be better ways to do this/solve this, and I’d love to hear them if there are, but this is what I went with for the results below.
That just about wraps up the home team term; now, on to the denominator of the equation. While many if not most park factors compare home production to away production, unless you have a very specific and more obscure goal for your park factors, this isn’t the correct way to do things. If a park factor is meant to remove any park effects and place a player in a theoretical league-average context, the point of comparison needs to be league average production, not away production. Now, the closer a park factor is to neutral the less this matters, since the distance of the road production from league average production must be 1/n the distance of the home production from average (because park factors must average out to neutral). If it were difficult to get the league average version, you could justify using away figures instead, but since it's very much *not* difficult, I used the league average. No further adjustments were needed; since regression is towards league average, none was included here, and using league average eliminated any over-representation from a single team in the sample.
All of that gives you a raw park factor number. In theory, if you do this for all teams in a given year, they should average to 1. I found that this generally doesn't happen; I assume it's due to the regression and adjustments, but I can't say for sure. As the last step in the process I artificially and linearly adjust each factor to force the average in each league to be 1. The final equation comes out to the following (which looks even worse in Excel, trust me), where TOI is team of interest, OPP is that team’s opponents, and POI is park of interest:
At this linked spreadsheet, you can find single-year, three-year average, and five-year average park factors, both halved and unhalved, split by handedness for all teams and years since 1974. The averages are "surrounding"-year averages; that is to say, the year in question is the central point of the time period being averaged. Averages are interrupted by teams moving to new parks, but not by any configuration changes to existing parks.
I personally find the most value in the single year numbers, but there are good arguments to be made for using averages. Single-year averages certainly appear to be noisier, but this is to be expected, and it's closer to being a feature than a bug. Part of that noise is due to a park "feature" that absolutely has an impact on the game, but gets lost if averaged factors are used: weather.Over the long term, the *climate* of a given city will be relatively stable, with changes happening over the course of many years; the *weather*, however, is much more variable season-to-season, and has a huge impact on batted balls, pitch movement, etc. Any park factor that's going to be applied to past data should account for that; hence, a single-year factor is best. Further, since the baseline is league average (and is hence affected by changes, in weather or anything else, in all league parks), it's to be expected that yearly numbers vary a bit.
Multi-years numbers definitely have their place as well, though. Anything forward-looking - say, a projection system - that wants to account for park effects would be better served in using multi-year park factors to estimate the adjustment that should be used. I didn't have the time to get the data on that, but it can be inferred from the following graphs, which show single-year, three-year, and five-year average park factors for Wrigley Field.
Throughout the above, ISO was my example; this is because wanting to create ISO+ (that is, league- and park-adjusted ISO) drove me to all of this in the first place. Not wanting to leave that idea hanging, below you can find both ISO and ISO+ for qualified batters in 2013. A quick glance through the data shows that the Pirates are helped out a lot, in terms of overall rank, by this method, with Andrew McCutchen and Neil Walker each jumping 17 spots. The Blue Jays are hurt (again, by ranking) a bit, with Jose Bautista and Adam Lind falling 7 and 8 spots, respectively. This is the most superficial of analyses, but maybe someone can find something more interesting.
|Alfonso Soriano||- - -||- - -||0.235||160|
|Marlon Byrd||- - -||NL||0.220||150|
|Mark Reynolds||- - -||NL||0.172||115|
|Alex Rios||- - -||AL||0.154||105|
|Justin Morneau||- - -||- - -||0.151||100|
|Alejandro De Aza||CHA||AL||0.142||92|
|Michael Young||- - -||NL||0.116||78|
|Eric Young||- - -||NL||0.087||60|
Anyway, I hope someone out there found this all useful. I’d love any feedback or questions you might have, since I’m planning on doing this same process to establish (better) park factors for a bunch of different stats as prep work for a series of cross-era comparison articles coming somewhere down the line. For example, I thought about trying to account for schedule imbalance when I weighted opponents' ISO in the numerator, but it was difficult to accomplish in my spreadsheet and I guessed that the increase in accuracy wasn't worth the effort. If there's anything you notice, please let me know.
. . .
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org. Some other statistics courtesy of FanGraphs and Baseball-Reference.
John Choiniere is a researcher and featured (occasional) writer at Beyond the Box Score. You can follow him on Twitter at @johnchoiniere.