In a recent Baseball Prospectus ProGUESTus article, I dove in to examine the called strike zone during the PITCHf/x era, among other things to investigate the effect on the size and shape of the zone that various game states impose. During that study, I found that several aspects of the game have an impact on the way the strike zone is called. With this framework in place, I wanted to attempt to derive a formula to describe the strike zone as accurately as possible.
A drawback of the high resolution 1x1 square inch grid technique that I used to measure the size and shape of the strike zone is I cannot simply create such a grid for all 82 mph curveballs that were taken by the hitter in a 2-2 count with a runner on second and one out in 83 degree weather, as the sample sizes would be practically nonexistent.
What I do have though are individual grids for each of these component pieces. One for pitches under 84 mph, one for counts with two strikes, one for situations with one out, etc. In other words, given the final location of each pitch, I know what percentage of pitches thrown with that velocity to that location were called strikes. Likewise for the other factors.
To attempt to classify pitches as strikes or balls then, my first inclination was to try a multiple logistic regression using these component pieces. My predictor variables were the percentages of pitches called strikes for the particular grid location of each pitch for each of the game factors. The categorical output was either strike or ball. In executing this test, I found a lot of statistical significance across the categories, but there were four that stood out as the most influential.
The first was batter handedness, which actually had to be one since I did all of my strike zone analysis on LHH and RHH separately, since we know their called strike zones are so distinct. The other three most significant predictors were batter height, number of strikes in the count and pitch velocity. In fact, I did not include pitch type as a possibility, simply because there are too many types, including rare types like pitchouts and knuckle-curveballs, and I wanted to be able to classify every pitch using this method.
As I began to calculate the success rate of the output formula for all called pitches, I noticed fairly quickly that the inclusion of some or all of these additional factors into the equation does not really raise the level of accurately predicted pitches very much. Here is a sample table showing the success rate of all called pitches (in this case excluding pitchouts, hit by pitches, balls in the dirt and intentional balls) for RHH in 2012:
|Hand / Height / Strikes / Velocity||185,071||17,869||91.19%|
|Hand / Height / Strikes / Velocity / Outs / Bases / League / Temperature||185,109||17,831||91.21%|
Results of called pitch prediction, RHH, 2012
We can see that adding the three other most significant predictors to batter handedness improves our success rate by correctly classifying 777 more pitches. For reference, for 2012 the LHH equation added 677 pitches to the total. So for an entire season's worth of pitches, taking into account the batter height, strike count and pitch velocity in this manner netted us 1,454 more pitches in the good column out of the 367,332 pitches in our sample, just 0.4%. Alternatively, looking at the table above, we successfully converted 777 out of 18,646 that we were previously binning incorrectly for RHH, which is over 4% of that total.
In one sense, this is useful knowledge, as these improvements are going to be on borderline pitches that we are now identifying with the context of the game factors in hand, and making a better judgement. On the other hand, adding all of this complexity to improve our success rate by 0.4% may not be worth the effort for most applications of this data.
Seeing how successful the classification was purely by using the batter handedness, I turned my focus to identifying a formula for the strike zone based solely on this factor that could be employed without the privilege of the grid data that I generated. I felt like this would be more useful to the general public than me giving you numbers without you being able to reproduce them. Of course there is already one perfectly good simple formula for the strike zone, developed by Mike Fast, which relies on pitch location and batter height information. I wondered though if I used a non-rectangular shape to fit the zone if I could remove the height input without sacrificing much in way of accuracy. After all, quick: how tall is Lorenzo Cain? I don't know. I mean, I can look it up, as can you, but I wondered if we could make the formula even "simpler" in terms of input requirements.
After some experimentation, I settled on an elliptical model to describe the strike zone. This is not groundbreaking in of itself, as Matthew Carruth has explored this previously at FanGraphs when he examined the strike zone by count. I arrived at the following formula for 2012, which has versions for each batter hand:
RHH: ((px + 0.04167)^2 / 1.17361) + ((pz - 2.52083)^2 / 0.95877) <= 1
LHH: ((px + 0.22918)^2 / 1.12891) + ((pz - 2.50000)^2 / 0.91840) <= 1
For the 2012 season, these formulas correctly identify as a strike or a ball 90.53% of all called pitches in the context of the above table, or 90.87% of all pitches where the batter did not attempt a swing (including now pitchouts, etc.). So this is relatively accurate, given the irregular shape of the strike zones and the fact that we are employing only the pitch location as an input.
With these formulas in place, I calculated my own plate discipline statistics for batters in the 2012 season. This would serve as a sanity check for the equations, but a secondary point of curiosity for me was to compare the various swing-related metric reported by both BIS and PITCHf/x and see if each of them is accounting for the "lefty strike" or not.
The most critical plate discipline statistic to get right here is Zone%. After all, it is pretty easy to count swings vs. no swings and contact vs. no contact. The denominator for all of the Z-based and O-based swing metrics relies on a proper Zone% as a foundation. The following table shows the comparison of Zone% as calculated by the above formulas with the BIS Zone% and PITCHf/x Zone%, from Fangraphs.
|Hitters||Average Absolute Delta||Average Delta|
Average Deltas Between Calculated Zone% and Published Zone% Metrics, 2012 (batters with > 500 pitches faced)
Clearly the results of my formula align much better with the PITCHf/x strike zone than the BIS strike zone. Of course, this would probably be expected, given that I arrived at my simplified formula using PITCHf/x data as the raw inputs. Note that the formulas above give a distribution that is almost exactly centered on the PITCHf/x mean for the strike zone.
A second observation here is that the BIS strike zone for left-handed hitters looks to be much more distinct than their right-handed batters' zone. This realization can also be seen in looking at the hitters whose BIS-reported Zone% is the furthest away from the zone calculated here. 86 of the 100 hitters with the largest discrepancy bat left-handed. So BIS is certainly measuring the strike zone differently for left-handed hitters than PITCHf/x.
There are some potential theories here that could explain this difference. The most likely is that BIS is intending to report the strike zone based on its rulebook definition. This is certainly a defensible position. Regardless of whether that is true or not, they may also be reporting the same zone (however they have defined it) as they have since these statistics were first introduced, to maintain backward compatibility.
While I do not know how BIS defines their zone, given these observed differences, one simple test we can perform is to look at the largest left-handed hitting outlier along with the hitter for which the zones agree the most, and look at their pitch map from 2012.
The largest gap comes from Skip Schumaker. I calculated his Zone% as 50.8%, PITCHf/x has 49.9% but BIS has just 39.3%. This is a difference on the order of 136 pitches out of the 1187 pitches he faced last season, over 11%. Here is the map of his called strike zone.
Of course these images show only pitches taken by these two left-handed batters, so all of the pitches that were swung at are not displayed. Even from this set, we can see that Saunders was pitched in a much more balanced manner across the inside and outside edges of the plate than Schumaker. Looking at the swinging pitches for these hitters doesn't sway that observation.
From this (tiny!) sample, it makes the possibility that BIS is reporting more of the rulebook strike zone look plausible. Assume BIS was not attempting to include some or all of the "lefty strike" region that is called by umpires, but attempting to include pitches on the inside corner to lefties that would be strikes by the rulebook but are typically not called strikes. In Schumaker's case, he would lose out on many outside pitches, yet have virtually no pitches on the inside corner to offset this, leading to a smaller reported zone. In Saunders' case, the pitches off the outside edge that are ignored are made up for by a much more similar number of pitches seen on the inside corner.
So the question is, do these seemingly small differences in the reported numbers make a real difference? To see how the decision on Zone% filters down to the plate discipline stats that we know and love, we can now turn our attention to Z-Swing%.
|Hitters||Average Absolute Delta||Average Delta|
Average Deltas Between Calculated Z-Swing% and Published Z-Swing% Metrics, 2012 (batters with > 500 pitches faced)
Interestingly, despite the significant gap above, the Z-Swing% differences end up looking less severe than they did on the whole for BIS, while magnified for PITCHf/x. When analyzing plate discipline stats though, we would be typically looking at individual hitters, not the entire population. So let's see how this looks at the player level, by looking at the hitters with the largest gap between my calculated Z-Swing% and reported Z-Swing%.
|Hitter||Bats||Calculated Z-Swing%||BIS Z-Swing%||PITCHf/x Z-Swing%||BIS Delta|
Hitters with Largest Z-Swing% Deltas, Calculated vs BIS, 2012 (batters with > 500 pitches faced)
Notable here of course is that all of the largest differences come from hitters who hit left-handed in either all or most of their plate appearances. In fact, you have to scroll down to number 40 on the leaderboard until the first right-handed hitter is reached in DJ LeMahieu.
Considering the magnitude of the differences for a moment, for context, the standard deviation of the BIS Zone% for this sample of hitters is 5.7%. So hitters listed in the above table are one standard deviation away from the zone that I've calculated, a difference that is sometimes even slightly larger when compared to the PITCHf/x reported zone.
Here is a look at the pitches taken by Jay Bruce in 2012. Again we see a dense grouping down and away whose classification could be driving the observed differences.
After making these observations, I was able to ask Baseball Info Solutions Vice President of Product Development and Sales Ben Jedlovec about the way that they define the strike zone. He confirmed my suspicions, namely that BIS defines the strike zone per the rulebook, adjusted for the batter's height accordingly. This explains at least the bulk of the difference that we are seeing compared to my calculations, which are based on PITCHf/x raw data and show a consistent "lefty strike" is in practice called by umpires despite these pitches not falling within the rulebook strike zone.
For comparison, let's look at the largest differences of hitters between my calculated Z-Swing% and the PITCHf/x reported Z-Swing%.
|Hitter||Bats||Calculated Z-Swing%||PITCHf/x Z-Swing%||BIS Z-Swing%||PITCHf/x Delta|
Hitters with Largest Z-Swing% Deltas, Calculated vs PITCHf/x, 2012 (batters with > 500 pitches faced)
To start, this PITCHf/x leaderboard is balanced between left-handed and right-handed hitters. Given that the standard deviation of the PITCHf/x reported Z-Swing% in this sample is 5.5%, the number of hitters with close to that much of a gap is much smaller. Another observation is that some of these hitters are shorter than average, and given that I ignored the height input in my calculation, hitters at the extremes for height should be among those most likely to differ.
We can complete the picture by noting the O-Swing% differences.
|Hitters||Average Absolute Delta||Average Delta|
Average Deltas Between Calculated O-Swing% and Published O-Swing% Metrics, 2012 (batters with > 500 pitches faced)
The comparison with the PITCHf/x data as a whole makes sense. The zone defined by the formulas above assigned 0.1% more of all pitches into the strike zone than the PITCHf/x numbers. With the slightly different zone definition, the formula arrives at a 1.0% higher Z-Swing% rate on average, with an expected complimentary 0.8% lower O-Swing% given that the total number of swings should be constant.
Of course the swing count and total pitch count are only constant if the data set is the same, and also if the definition of a "pitch" per these metrics is the same. I say this because when analyzing the entire picture of BIS data, it defines less pitches inside the zone, but reports both a higher average Z-Swing% AND O-Swing% for the hitters in the sample. Sure enough, if I calculate the average Swing% of all hitters in the sample, I get 46.1% for BIS, 45.6% for PITCHf/x and 45.7% for my calculations. So there is something slightly different about the data sets and/or the definition of a swing or a "qualified pitch" in the context of these stats. Ben Jedlovec related that Fangraphs calculates these plate discipline metrics based on raw BIS inputs, so this question remains open.
In summary, I think that my formulas above have passed the sanity test for defining the strike zone in the 2012 season. As we know the zone has been slowly increasing, in particular by dropping at the bottom, so these numbers may have to be tweaked to be applied to pitches in 2013 or any other season. I have not vetted them against data sets from other years.
In addition, I believe it is fair to say that the BIS and PITCHf/x zone definitions are reasonably different, in particular for left-handed hitters. By using the raw PITCHf/x data to examine the zone, it is my belief that the zone used by PITCHf/x for its plate discipline statistics is more accurate to the strike zone as it is actually called by umpires today. Of course this belief is based on the assumption that the PITCHf/x data is accurately portraying the zone in a global sense, including that the "lefty strike" is truly present and not an artifact of the calibration or measurement process. Given that I have not seen anybody put into question the validity of this zone for left-handed hitters over the more than five years that the public has been analyzing this data, I believe it to be true.
I suppose this boils down somewhat to a philosophical question. What information do we want to extract from these numbers? How well hitters know the strike zone as it is called, and what their behavior is on pitches inside and outside of this zone? Or is it also interesting, or more interesting, to see how hitters behave when a pitch is over the plate in the rulebook strike zone where they are most likely able to make better contact, regardless of where the called zone really is?
While most of the differences are very small, if one were considering adding someone like Orlando Hudson to a roster for a bench role, it might be interesting to know if his O-Swing% is closer to 28.1% (fairly average) as reported by BIS, or 20.6% (fairly great) as reported by PITCHf/x. My formula says 20.2% on this one.
After going through this exercise, I believe the following about plate discipline statistics. If you want to see how hitters perform against the rulebook strike zone, seek out the BIS set of numbers. If you prefer analyzing how batters are faring more closely against the strike zone the way it is currently called in Major League Baseball, use the PITCHf/x set of values. If someone on your favorite team popped out with the bases loaded on a questionable pitch and you wonder if he chased the pitch or not, you could zip over to Baseball Savant for example, download the appropriate CSV file of pitches, then plug in the px and pz values into the formulas above, knowing that it was accurate for more than 90% of all pitches. If you absolutely want to squeeze out the last little bit of accuracy based on game state, you can ask me for the numbers, as that process is not one that lends itself to calculating them yourselves.
In the end, if the BIS and PITCHf/x numbers were identical, then they would be telling us the same thing, which would be kind of boring. With an understanding of the way each of these strike zones are measured, we can actually infer two different pieces of information from the reported numbers that may tell us slightly different things about a hitter.
. . .
You can follow me on Twitter at @MLBPlayerAnalys. Follow @MLBPlayerAnalys
More from Beyond the Box Score:
- Have there been more top prospect callups in 2013?
- The Smallest Sample Size 8/5/13: Burnett's big year
- Lineouts! 08/05/13: Suspenions, fragile Hanley, and tweeting wives
- What to do about Ellsbury
- What Can Granderson Offer the Wounded Yankees?