Other People's Work
Like most things in the sabermetric world, the work in this article builds on the efforts of other researchers. I recommend Dan Turkenkopf, Jonathan Hale, John Walsh, and Dave Allen for background on framing and what causes ball and strike calls in general. I know I've missed good work, but if you've found your way here, you know the right places to look anyway.
Building the Metric
Step 1: Data Set
Included in the data set are all pitches from MLB Gameday
1) With pitchFX data
2) With des type "Called Strike" or "Ball" (NOT "Intentional Ball" or "Ball in Dirt")
3) With pz (vertical height when crossing the front of the plate) greater than 0. In theory all pitches with pz < 0 should be classified as "Ball in Dirt" but in practice they are not.
4) From 2008 and 2009. Each year is treated as a separate data set. 2007 is excluded.
Step 2: Normalize pz
Batters have very different vertical strike zones, and in order to create a model everything needs to be put on the same scale. I used the method Tango details here.
Step 3: Divide Into Bins by Count and Batter Handedness
If you've read the links in the introduction, you'll be aware that these two characteristics have a profound impact on called strike zone size. Each combination of count and batter handedness is treated as a unique strike zone.
Step 4: Use Local Likelihood Regression to Estimate Probability of a Pitch being Called a Stike
For each of the bins created in step 3, a strike zone is constructed using a local likelihood model. For any pitch, given its vertical and horizontal location as it crosses the plate, current count, and batter handedness, a probability that it is called a strike is assigned.
Step 5: Calculate Extra Strikes for Each Pitch
For each pitch, the difference between the actual call (1 for a strike, and 0 for a ball) and the probability of a strike is calculated,with the difference being the extra strikes, above or below league average, generated on that particular pitch. For example, a pitch with a strike probability of 0.6 that was actually called a strike, would be 0.4 extra strikes.
Step 6: Umpire Adjust
Umpires' zones can differ greatly. For each umpire, an average number of extra strikes per pitch is calculated, and applied to all pitches that umpire was behind the plate for. For a pitch with 0.4 extra strikes after step 5, and called by an umpire with 0.05 extra strikes per pitch, the adjusted extra strikes would be 0.35 extra strikes.
Step 7: Roll up by Catcher, Pitcher, or Pitcher-Catcher Combination
Rolling up by catcher ignores any effects from the specific pitchers they caught, and rolling up by pitchers ignores the effects from their catchers. As of now, the metric has no mechanism for making these adjustments. Total extra strikes and extra strikes per called pitch give a total value and rate version of the metric. Extra strikes per called pitch (ES/CP) is particularly convenient, as it is approximately equal to wins per game caught. (Average of ~75 called pitches per team, per game * ~0.13 runs per extra strike, as calculated by Dan Turkenkopf * 0.1 win per run = ~ 1)
The results for each pitcher, catcher, and battery is available on Google Docs here.
The results for catchers ranged from -0.058 ES/CP (Ryan Doumit, 2008) to 0.053 ES/CP (David Ross, 2009). This is equivalent to -7 to 6.3 wins per 120 games caught, or -0.54 to 0.49 ERA (assuming 92% of runs are earned). This is a HUGE number.
Year to year consistency is considerable. I broke both pitchers and catchers into quartiles (as well as top half for catchers) based on average called pitches per year, 2008-9, and calculated the correlation coefficient, and its standard error for each. Then, using the formula r = X/X+c, calculated the number of called pitches necessary to produce r=0.5. I performed the same calculation for r + 2 standard errors and r - 2 standard errors to construct a 95% confidence interval on c.
|r||SE[r]||Sample Size (Called Pitches)||SS for r=0.5, mean||SS for r=0.5, high||SS for r=0.5, low|
Both pitchers and catchers show a large amount of year to year consistency - it is expected that they follow somewhat due to batteries working together in both years. As an attempt to separate effects, I took catchers from the top 3 quartiles and pitchers from the top quartile (similar numbers of players) who played for different teams in 2008 and 2009. The results were:
|r||SE[r]||Sample Size (Called Pitches)||r=0.5, mean||r=0.5, high||r=0.5, low|
Conclusion: I still don't know how to break apart pitcher and catcher contribution to ES/CP. But, it definitely matters.
I am aware there is still much to be improved here.
...what about movement?
I'm sure it matters, but as an improvement to accuracy rather than a core component. Dave Allen created a model of strike likelihood which included the break parameter. While he found it to be statistically significant, the largest difference in break contributed about the same as an inch of position. I was unwilling to include this until I had a framework I was happy with, as some work would have to be done to determine whether to use break, pfx_x and pfx_z, or something else.
...shouldn't the umpire adjustment be part of model construction?
Yes, ideally, it should. The way it's detailed above, it's assumed that all umpire zones are the same shape, just different sizes, and pitches are distributed in the same way for every pitcher and catcher. This is not the case. But, including umpire identity as another set of bins would cause crippling sample size issues.
...those effect sizes are too big! I don't believe them.
I didn't believe them at first either, and I'm still a bit skeptical. But this makes three studies (Dan Turkenkopf's plus my original attempt at this using the rulebook zone) which all have large effect sizes. If you know what is wrong, I'm all ears.
...how are you going to verify these are "real"?
Good question. I'm not sure yet. My best idea is to take all pitchers who have a new primary catcher in 2010 than they had in 2007-2009. Then, take those pitchers' Marcel, Zips, and CHONE ERAs for 2010, and adjust them according the difference in ES/CP between old and new catchers. Then, at the end of 2010, compare the original forecasts to the adjusted forecasts, and see how often and by how much they improved the ERA estimates.
..I've got some other issue with this.
Tell me what it is! This will only get better if it has holes poked in it.