A few weeks ago, I embarked on a difficult sabermetric journey with hopes of possibly discovering factors that would help us predict which characteristics of pitches result in more whiffs.
The major reason behind this journey was the fact that when one breaks down the anatomy of what makes a pitcher successful the most important component (in almost all cases) is their strikeout ability. Then when one considers the anatomy of a strikeout the most prominent component is their ability to get batters to swing and miss (whiff).
The major study within the first piece looked at 50 individual pitchers who threw more than 200 fourseam fastballs in 2012.
I pulled data on every fourseam fastball that a batter swung at against each pitcher and labeled the swings that resulted in contact as a 0 and the swings that resulted in a whiff as a 1.
I tested various predictors such as, the fastball's velocity, movement and location, to test to see whether we could find out what was differentiating the whiffs from the other pitches.
This study resulted in this interesting, yet very logical conclusion:
It seems to me, based on the results of this sample, that on average individual pitchers generate more whiffs on fastballs that are higher in the zone relative to their other fastballs.
This result lent credence to a traditionally accepted idea of a high fastball or "high heat" as a pitch that is more difficult for a batter to catch up and thus, results in more whiffs.
This study was in no way perfect as there is so much going on on a pitchbypitch basis, it's really difficult to find any conclusions that could hold up under scrutiny. Thus, I decided to revamp the original study to include some quality suggestions I received on how to improve it.
The first was an issue with my choice in model. An emailer, Nick Embrey, pointed out to me that I should scrap the multiple linear regression model that I used previously in favor of the nonlinear probit model.
If you'd like to learn more about the specifics of a probit model you can follow this link. Although I think the emailer explained the idea the best when he said:
The main drawback of a linear model when you have a binary dependent variable is that your fitted values can be outside of the [0,1] interval, which doesn't really make sense. This is because what you're really modelling is the probability that the pitch results in a swing and miss or not, and a probability can only be between 0 and 1. A probit model restricts the range appropriately.
Given that the basis of my hypothesis was the find which characteristics of a pitch that increased the probability (again, between 0 and 1) of the pitch resulting in a whiff. The probit model made a great deal of sense.
The second suggestion was to factor in the sequencing of the pitch. Taking the entire at bat or pitch sequence into account was honestly way too complex; however, I thought looking at the previous pitch could be fruitful.
If we assume, based on the original study, that a higher than usual fastball results in more whiffs then my hypothesis is that if the previous pitch was lower and slower in the zone then the high fastball will become more effective.
The Study
I took the same sample of pitchers as in the previous study and again classified the swings that resulted in contact as 0 and swings that resulted in a whiff as 1. The independent variables that I used were:
 The velocity of each fastball where there was a swing
 The vertical location of each fastball were there was a swing
 The difference in velocity of the fastball and the previous pitch
 The difference in location of the fastball and the previous pitch
The Results
The only real issue that I ran into with the probit model is that interpreting the results is slightly more difficult.
The probit model is nonlinear and thus typical linear measures of goodness of fit do not apply. However, a pseudo rsquared can be calculated from the model that is fairly comparable to the typical rsquared from a linear model.
I took the square root of the pseudo rsquared that I found for each player and used that as the quasi"correlation" or "r" of the model.
This "r" is not the same as one that we would find in a typical correlation, but for all intents and purposes of this piece, we'll consider them to be equivalent.
Below I listed these results for each of the pitchers in the sample:
 The "correlation" or "r" of the model for each pitcher
 Whether or not the vertical location of the fastball was significant at a 95 percent confidence level
 Whether or not the velocity of the fastball was significant at a 95 percent confidence level
 Whether or not the change in vertical location of the fastball from the previous pitch was significant at a 95 percent confidence level
 Whether or not the change in velocity of the fastball from the previous pitch was significant at a 95 percent confidence level
Pitcher 
"r" 
Vert. Location 
Velocity 
Velocity Change 
Location Change 
0.417 
No* 
Yes 
No 
No* (Negative) 

0.365 
Yes 
No 
No 
No* (Positive) 

0.353 
Yes 
No 
No 
No* (Negative) 

0.345 
Yes 
No 
No 
No 

0.342 
Yes 
No 
No 
No 

0.314 
Yes 
No* 
No* (Negative) 
No* (Negative) 

0.304 
Yes 
No* 
No 
No 

0.297 
Yes 
No 
No 
No 

0.297 
Yes 
No 
No 
No 

0.296 
Yes 
Yes 
No 
No 

0.288 
Yes 
No* 
No 
No 

0.282 
Yes 
No 
No 
No* (Negative) 

0.281 
Yes 
No 
No 
No 

0.270 
Yes 
No* 
No 
No 

0.268 
Yes 
No 
No 
No 

0.266 
Yes 
No 
No 
No 

0.266 
Yes 
No 
No 
No 

0.266 
No 
No 
Yes (Positive) 
No 

0.262 
Yes 
No* 
No* (Negative) 
No 

WeiYen Chen 
0.261 
Yes 
Yes 
No 
Yes (Negative) 
0.258 
Yes 
Yes 
No 
Yes (Negative) 

0.252 
No 
No 
No 
No 

0.245 
Yes 
Yes 
No 
No 

0.243 
Yes 
No 
No 
No 

0.243 
Yes 
No 
No* (Positive) 
No 

0.241 
Yes 
No 
No 
No 

0.236 
Yes 
No 
No 
No 

0.234 
Yes 
Yes 
No 
No 

0.232 
No 
No 
No* (Negative) 
No 

Miguel Gonzalez 
0.229 
No* 
No* 
No 
No 
0.223 
No 
No 
No 
No 

0.216 
No 
No 
No 
Yes (Negative) 

0.214 
Yes 
No 
No 
No 

0.214 
Yes 
Yes 
No 
No 

0.199 
No 
No 
No 
No 

0.198 
No 
No 
No 
No 

0.187 
No 
Yes 
No 
No 

0.165 
No* 
No 
No 
No 

0.162 
Yes 
No 
No 
No 

0.160 
Yes 
No 
No 
No 

0.152 
No* 
No 
No 
No 

0.150 
No 
No 
No* (Negative) 
No 

0.146 
Yes 
No 
No 
No 

0.134 
No 
No 
No 
No 

0.134 
No 
No 
No 
No 

0.125 
No 
No 
No 
No 

0.121 
No 
No 
No 
No 

0.110 
No 
No 
No 
No 

0.089 
No 
No 
No 
No 

0.084 
No 
No 
No 
No 
*indicates that the predictor was significant at a 90 percent confidence level.
It's clear when comparing these measures of goodness of fit to those from the original test that the probit model is more suited for this study, as each "correlation" became stronger.
Vertical Location:
These results backed the "high heat" conclusion that was found in the first test, as the vertical location of the fastball was significant at a 95 percent level for 64 percent of the sample and was significant at a 90 percent confidence level for 72 percent of the sample.
Velocity of the pitch:
These results also backed the original study with the conclusion that the velocity of the fastballs that resulted in whiffs were no different than the ones which did not result in a whiff, as the velocity was only significant at a 95 percent confidence level for 16 percent of the sample and significant at a 90 percent confidence level for 28 percent of the sample.
Change in location from the previous pitch:
I was surprised to find that change in location was not a significant predictor for the majority of this sample; 6 percent significant at 95 percent level and 16 percent significant at 90 percent level. I was even more surprised to find that for the few pitchers for whom I found a significant relationship that relationship between the change in location and the probability of the whiff was, in fact, negative.
This would indicate that the further away the previous pitch was to the fastball, in terms of vertical location, the less likely it was that a whiff would occur on the fastball.
Change in velocity from the previous pitch:
Similarly to the change in location from the previous pitch, the change in velocity was also not a significant predictor for the majority of the sample; the change in velocity was only significant for one pitcher in the sample at a 95 percent confidence level and only 12 percent of the sample at a 90 percent confidence level.
Interestingly enough, the majority of those significant relationships were also negative, which meant that the larger the gap between the velocity of the previous pitch and the velocity of the fastball, the less likely it was that a whiff would occur.
The results for velocity and location change, which coincided with my attempt to take pitch sequencing into account, were the exact opposite of what I expected to find in my hypothesis. As I had expected the larger the difference in velocity and location would indicate a greater probability of a whiff on a fastball.
Based on this more extensive test it seems the only real conclusion that I could make on what could possibly increase the probability of a fastball resulting in a whiff is to elevate the fastball relative to others; the "high heat" assumption.
How much does elevating the fastball increase the probability of a whiff?
I'll use Zach McAllister of the Cleveland Indians as an example, as he had the strongest relationships between vertical location and the probability of a whiff in this sample.
According to the probit model, the marginal effect or small increase in vertical location for McAllister will increase the probability of a whiff by 20.6 percent. I personally think this is a fairly large increase, but keep in mind that McAllister's vertical location was the strongest predictor in this sample.
Overall the explanatory strength of the predictors (even the significant ones) in this sample was fairly weak. But, again, we should not expect a lot of explanatory strength when analyzing something on a pitchbypitch basis.
My goal with this series was to not only look at just fourseam fastballs, but to see what we can learn about what may explain a whiff for other pitch types, as well. Thus, I posted a poll below where you can vote on which pitch type you'd like to see studied next.
All data comes from the PITCHf/x database available on Baseball Heat Maps.
You can follow Glenn on twitter @Glenn_DuPaul.
Objective analysis of all things baseball at Beyond The Box Score