For a while, I have wondered if pitch data could be used to estimate a player's walk and strikeout rates. At Fangraphs.com, they display the percentage rate for pitches swung at and hit inside and outside the strike zone for each player (O-Swing%,Z-SWing%,O-Contact%,Z-Contact%). Using multiple variate regression, I took the 4 variables (outside swing and miss, outside contact, strike zone swing and miss, strike zone contact) and compared them to strikeout and walk percentages.
For this first run, I looked all the qualified hitters (500 PAs) from 2009. For the strikeout percentage, I got a r-squared of 0.89 and a standard deviation of 2.0% on the difference from the projected and final values. For the walk percentage, I ended up with an r-squared of 0.63 and a standard deviation of 2.0 on the difference from the projected and final values.
I went to look through this dataset and saw that some players had an actual much higher actual walk rate vice projected from a 6% to 8%. These players were all great hitters (Fielder, Pujols, A, Gonzalez) and it dawned that IBB was included in the walk rate and I needed to factor it in. I included a fifth variable in the walk calculations, IBB/PA and re-ran the regression. The results were much better. With an r-squared of 0.79 and the standard deviation of 0.15%. The highest percentage difference was 4% vice 8%. Here are the equations for estimating walk and strikeout rate:
SO% = ((-0.0407*O-Swing%)+(-0.2417 * Z-SWing%)+(-0.2429*O-Contact%)+(-0.8765*Z-Contact%) + 1.2885)*100%
BB% = ((-0.4134*O-Swing%)+(-0.0328*Z-SWing%)+(0.0216*O-Contact%)+(-0.2595*Z-Contact%)+ (1.7203*IBB per PA)+0.4217)*100%
Using these values, here are the players that I looked at the most deviate from the estimate and could be due for a correction in 2010:
|Name||2010 Team||2009 Walk Rate||2009 Estimated Walk Rate||Estimated – Actual|
|Name||2010 Team||2009 Strikeout Rate||2009 Estimated Strikeout Rate||Estimated – Actual|
|Kevin Youkilis||Red Sox||25.5%||21.2%||-4.3%|
I like the initial results and I am planning to add a few more years worth of data to get a better equation. I can see this formula being used to see if changes in walk and strike out rates is because of changes in plate discipline or just noise in the data.