And what that actually means...
by Michael Bradburn
Park factors, in basic terms, are simply a way of quantifying the differences between each ballpark and, despite what Cal Ripken tells you, they exist. This is relatively unique (at least in the four major North American sports) to baseball. No cathedral is built the same and very few are even symmetrical. On the micro-level, this is one reason why a left-handed power bat like Brian McCann was targeted by the Yankees. But on the macro-level, park factors should be able to tell us a lot more. And the perfect equation to calculate park factors unlocks amazing possibilities.
That was a really exciting day in southern Ontario. As far as sports go, the home city hasn't had a lot to cheer for. So when our general manager traded prospects away to the Marlins for a package including Buehrle and Reyes, it was, as understatements go, noteworthy.
That was then. This is now. And, for some reason, Josh Johnson didn't work out. I began thinking to myself: with all the pitching metrics we have, why didn't anybody predict this? How did we, diligent couch-analysts, not catch this before it happened?
Short answer: we’d be millionaires. But you’re here for the long answer.
So I started charting park factors, hoping that the more I familiarize myself, the closer I’d get to creating the perfect equation and preventing the next Josh Johnson. Or at least help prevent a team overvalue a player in a different environment.
I think I got pretty good at it. The goal at first was to create every ballpark's ERA. That is, the earned run average of the entire ballpark. This is simple in synthesis: take every run scored, including unearned, by both home and visiting team and divide that by the games played which is actually the innings pitched by both teams divided by 9 to be precise. Simple.
If I could then create the most succinct way of calculating park factors, I could maybe predict how any pitcher would pitch in every park.
Let's start with the Park Factors equation as it appears on everybody's favourite analytics blog: Wikipedia.
Not a bad place to start. Using a large enough sample size, that might be good enough to show the ratio by which runs are scored in one ballpark vs. other ballparks. I arbitrarily set this sample size at five years before realizing the first mistake.
Games? Why am I using games? That doesn't seem like a very precise science. That seems like an approximate science. Why approximate when we can be precise? In the event Josh Johnson puts a Blue Jays uniform on again, I don’t want to know approximately how bad he’ll be, I want to know precisely how bad he will be. Simple adjustment to the equation: take home innings pitched and road innings pitched and divide each by nine. Then I realized my second mistake.
To calculate a ballpark’s ERA, the nominator was easy enough: it’s runs scored and runs allowed while at home. But the denominator is innings pitched by both teams. Sounds easy again, but trying to find a statistic on how many innings the away team pitched for each ballpark is a challenge. I had some options. The first was to look at every game played in every park and count them out by hand, but I was doing five years worth of park factors. So, I took the less accurate route, and subtracted home wins from away innings pitched. If you have the lead as a home team in the ninth inning, the away team doesn’t pitch to you in the bottom of the ninth. Trivial to some, but, according to my chart, this adds up to over four games difference pitched between home and road games. That is to say, every team pitches, on average, 40.6 more innings on the road than they do at home. Over a 2916 inning-long season, 40.6 may not seem like that big of a deal, but to omit it would be careless.
Now the dream grew. I had plugged in the numbers and had a pretty good park factor table with nothing too far out of consensus. The Colorado Rockies had the best hitters park in baseball in every year except 2011. Similarly, the Padres and Giants showed up as the best pitchers parks along with, somewhat surprisingly, the Mariners. It seems like bringing in the fences at Safeco in 2013 actually had quantifiable change as it went from the best pitchers park in 2012, to the 13th best pitchers park. I was going back to 2010 so I made sure to denote that the Marlins got a new ballpark in 2012. Also, the Astros switched from National League to the American League in 2013 and I wanted to see what that had done to their park factors. But that wasn’t enough, I needed to create the formula that would, using only these numbers, predict the ballpark’s ERA.
The formula I settled with, after significant tinkering was this:
pBPERA = league average BPERA x (BPF/100)
Dividing by 100 gets you back to the ratio by which runs are scored at home versus other ballparks. Keep in mind, the original park factor formula multiplies by 100 just to make the number look like a percentage.
Another question; how do we get the league average BPERA before the season starts? If this truly is a predictive formula, we will need that value prior to the season’s beginning. At first I just wanted to see if this returned correlations, so I calculated the true league average ERA by averaging out the BPERA I had allotted. We could always go back and set these values manually. The average BPERA varied between 4.09-4.48, trending downward, in those five years. It seems a bit counter-intuitive to project a value just to project a different value, but this happens routinely in other sabermetric statistics.
The last question I had to answer for myself: What to do with the Astros and the Marlins? There is only two years of data for the Astros, so I would like to see them omitted but we’ll continue to use them. No safe predictions can be made as of yet, but we’ll see. The Marlins on the other hand, have three years of usable data and I think that’s a strong enough sample size.
We’re a quick standard deviation calculation away from finishing this now. Take the totals of everything and make sure to weight them correctly and you get a mean BPERA of 4.28. Take the variance of the BPERA for 0.21, for a standard deviation of 0.462. Now, see how many of my projected values fall outside of this deviation for a grand total of two. There are two ballparks that my formula doesn’t like, Progressive Field, which lies 0.5011 outside of the mean, and Target Field, which only misses the standard deviation by 6 ten thousandths. Over five years, my predictive value looked especially strong at Turner Field (0.0845), US Cellular Field (0.0909), Great American Ballpark (0.0121), Coors Field (0.0479), Dodger Stadium (0.0184), and Nationals Park (0.0522), and Minute Maid Park (0.0860)* with* an* asterisk* or* 6*. Where asterisk denotes small sample size because of the switch from American League to National League. Perhaps this is needless worry, but this practice is still important.
Park factors, in a vacuum, aren’t really all together useful. Maybe that’s what Cal Ripken meant. Sure, we know they score lots of runs at Coors Field and not very many runs at Petco Park, but why care? It’s how we read these statistics and apply this knowledge. Just like any statistic. In this way, we can use park factors to better understand what a truly ‘bad’ or ‘good’ season looks like for each pitcher. Also, we can use it to assign values that actually mean something, like the expected runs of any given ballpark. I’m confident this could even be used to evaluate batters, but that’s for another day.
There’s a larger lesson in all of this though. No statistic or formula can be perfect. We’re a long way to predicting the next Josh Johnson. But with BPERA, we’re at least a step closer to understanding, quantifying and actually applying what park factors really mean. Of course, Josh Johnson only pitched for one season in the new Marlins Park before being moved to Toronto. And the most glaring statistical change, according to Fangraphs, is his HR/FB rate increasing by 10% from Marlins Park to the Rogers Centre. Naturally, Marlins Park is one of the parks with a small sample but, through the three years of its short existence, its park factor has been relatively static at 95.1, 98.6 and 95.4 while Rogers Centre has a 5 year park factor of 101.5.
This was a surprisingly large amount of data to collect for just the beginning. And there’s no doubting that my park factor formula will change in years to come to correct for the outliers. There’s nowhere near enough data to predict any single pitcher’s abilities in any single park as of yet, and there may never be. But I’d be confident enough to calculate, for instance, how many runs get scored in Cincinnati next year. Taking the subsequent logical steps could hopefully be something I not only post on Beyond the Box Score, but also complete with the assistance of the analytics community this blog has helped foster.