The last time we met, I used the ESPN Home Run Tracker data to find where more home runs are being hit and to what field. The next question is not only how many home runs are hit at each park, but what is the shape of the home run flight and how does that change based on park? What is the average height of each home run hit to each field at each park and how is that affected by the park dimensions?
For this analysis, I will be using the "Apex" value in the Home Run Tracker data. This measures "the highest point reached by the ball in flight above field level, in feet." Lower apex values represent line drive home runs while higher apexes can be called towering home runs.
I'll borrow the table used in the previous post to remind everyone of the angle buckets:
|Horizontal Angle||Field Position|
I split up Safeco Field, PETCO Park, and Citi Field when the dimensions changed. They are different stadiums and should be treated as such.
First some more rubber band charts, showing the average apex for all home runs hit in each park since 2006.
The orange line representing new Safeco Field sticks out like a sore thumb because it only covers 18 total home runs. I included it on the chart to show its comparison to the previous Safeco dimensions. Nearly every park follows the same trend, forming an "m" shape.
Comerica shows an extremely low average apex in left left field, while Kauffman, Fenway, and Target all have high average apex values there. Comerica's left field is not as deep as Kauffman's and Target's and Fenway has the Green Monster.
In center field, Comerica, Fenway, and Target all have higher than average apex values. Comerica and Target both have distances greater than 400 feet, but Fenway's unique shape presents a bit of a problem. Straight center field is only 390 feet, the shortest in major league baseball. However, the wall is 18 feet tall and juts back to 420 feet just right of center.
The Metrodome had the lowest average apex to center, while Progressive Field is the current park with the most line drives there.
In right field, the Metrodome jumps up to have the highest average apex and its successor ensures that the throne stays in the Twins organization. The old dimensions in Safeco Field had the lowest average apex to right field, just squeezing by Tropicana. In fact, Safeco proves to be conducive to lower apex home runs in all fields.
Now back to a graph.
Now back to ME. I mean the graph. Sadly, it isn't a home run video. This time around, it's the light green line messing everything up. That's Petco Park after the change in wall dimensions, so it only represents 16 home runs. I left it in to show that the changes in left center and right center field are already being seen.
The left field high extremes are Marlins Park, Minute Maid Park, and Citi Field. The lowest extreme is Wrigley Field. Those trends hold up fairly well in center field, though Chase Field usurps Citi Field with a higher average apex.
In right field, Marlins Park is still at the top, but PNC Park comes in second and Minute Maid Park ends up having the lowest average apex of current stadiums (Sun Life and Shea Stadiums had lower values).
So this is where my analysis takes a specific turn. Marlins Park has consistently high average apex at each angle except LCF and I wanted to find out why. Are the dimensions just longer? Are the walls higher? What is going on in Miami?
This led me to create three types of home runs based on apex: Line Drive (apex below 67), Towering (apex above 107), and Regular (everything else). I defined line drive and towering home runs as one standard deviation above or below the mean apex value. Look down, now back up. What do you see? The major league percentages of each type of home run by year since 2006:
A fewer percentage of home runs were line drives last year, but it isn't a large enough change to be driving the Marlins Park outlier. Line drive home runs were at their lowest in 2009 and their highest in 2007.
What's in your hand? Now back at the post, I have it. It's a table with the percentage of each type of home run at each park, sorted by increasing line drive percentage. Look again, the table is now diamonds:
|Citi Field Old||12.1%||70.9%||17.0%|
|PETCO Park (New Dimensions)||12.5%||87.5%||0.0%|
|Citi Field (New Dimensions)||15.9%||67.1%||17.1%|
|Safeco Field (New Dimensions)||16.7%||61.1%||22.2%|
|Great American Ballpark||18.1%||64.9%||17.1%|
|Safeco Field Old||18.4%||67.3%||14.3%|
|PETCO Park Old||18.6%||67.5%||14.0%|
|Sun Life Stadium||18.6%||65.7%||15.7%|
|Citizens Bank Park||19.7%||63.2%||17.2%|
|Minute Maid Park||19.7%||59.5%||20.8%|
|Old Yankee Stadium||19.7%||63.3%||16.9%|
Everything's coming up Marlins. Only 10% of the home runs hit there are classified as line drives. Yankee Stadium and Minute Maid Park are the only two with below 60% average home runs, as they are well above average in both the line drive and towering home run categories.
Why are there fewer line drive home runs hit in Marlins Park, particularly to center field? The wall is 415 feet deep and 20 feet high, which is obviously part of the problem, but which dimension matters more? Are there more factors not described by wall distance and height? You can run a simple linear regression and look at each, but why not do both at the same time? Enter multiple regression.
The first step to doing any sort of data analysis is to actually get the data. I already had the apex numbers from ESPN Home Run Tracker, but was missing park dimension data. I thought I was in for a brute-force copy-and-paste treat until Colin Wyers and Bill Petti guided me to the Seamheads downloadable park database, which allowed me to much more easily place stadium dimensions next to average apex values.
Next, I decided to use the average of the bottom 50% of apex values at each angle in each park instead of the total average apex. This will show the most line-drive of home runs while excluding any towering balls that would exit any park, regardless of wall height (I ran the same regression using simple average values and the results are very similar).
I ran this regression using left center, center, and right center field data, since they had the most well-defined dimensions for each of the parks in the Seamheads database. They also happen to match up with the values where Marlins Park really depresses line drive home runs.
Here is a table summarizing the important statistical findings of each regression*:
*We're about to get into more involved statistical techniques. I am by no means a statistician and have had only basic formal training in the field. I may miss some of the subtleties of the results, but hope that the overlying principles are still correct.
|Field||Variable||R Squared||Significance of F||T-Statistic||P-Value|
Left center field is just a mess and I don't know why. The correlation coefficient is low and the significance of F and p-values are high, meaning that the wall height and distance are not strongly related to the average apex of low home runs hit to left field. Perhaps graphing these values will help better explain:
There is almost no relationship, which is much different when compared to right center:
So which matters more, length to the wall or height? In each field, the t-statistic for wall height is larger, meaning it is a stronger variable and matters more than distance. Makes sense. Major league hitters can hit the ball extremely hard and even if the ball's highest point is only 60 feet above the field, it can fly for well over the distance it needs to go to clear the wall.
Now that we have equations giving us expected results, how does it compare to what we actually see?
LCF Apex = 56.8 + 0.044*LCF_Length + 0.097*LCF_Height
CF Apex = 9.16 + 0.146*CF_Length + 0.735*CF_Height
RCF Apex = 38.6 + 0.085*RCF_Length + 0.395*RCF_Height
These equations fall apart at the extremes. For each of LCF, CF, and RCF, the parks which outperform their residuals are those which have the highest average apex values. This means Fenway in left, Marlins in center, and Target in right.
However, a few parks outperform their residuals and do not have extreme apexes. Sun Life Stadium is the best example. It ranks 17th in apex for LCF, but ranks 32nd in residual performance. The LCF equation calculated an expected value of 78, but the actual value was 75.
The biggest outlier in center is Chase Field. It has the 2nd highest average apex for home runs, but ranks last in residuals. The equation actually predicts a higher apex (87) than reality (82). Perhaps the dry ball has something to do with that?
Finally, Citi Field is our outlier in right. It has a middle of the pack apex value (76), but has a much higher predicted value (79).
So how can the number of home runs be increased in Marlins Park? I think the first step is to simply lower the fences. According to Seamheads, the walls are 12-20 feet high. If they lowered them to about 8 feet, that will help the issue. I don't know how many line drives hit high up on the wall, but those would all become home runs. It wouldn't hurt to move the walls in a bit either. Marlins Park is in the top ten of predicted apex height based on distance and height to all three fields.
However, in the end we are once more faced with a question: why is the left center field data not significant? More home runs are hit there total, but is the difference large enough to be driving this? Is it related to left handed hitters and their large amount of opposite field home runs? Also, why do the prediction equations fall apart at the extremes and what other variables affect the line-drive-ness of home runs?
I'm on a horse.