Determining Batted Ball Rates using Pitch Type and Location
It is well-established that pitchers have control over their ground ball and fly ball rates--some pitchers, like Roy Halladay, are known for their extreme ground ball tendencies. But what allows these pitchers to achieve a markedly different batted ball profile from the average pitcher? I decided to use Pitch f/x data to determine whether batted ball rates depend on pitch type (as classified by Gameday) and location.
First, I divided the strike zone up into 9 zones of equal area and then added 4 additional zones outside of the strike zone corresponding to inside, outside, high and low pitches. Then I determined the league average batted ball rates for each pitch type in each of the 13 segments for 2008. Using these averages, I calculated each pitcher's expected batted ball rates based on his pitch types and locations. Splitters and knuckleballs were so uncommon that I ran into sample size issues when I divided them among the 13 zones; therefore, I did not incorporate pitch location data for them.
The results were somewhat surprising: the correlation between expected ground ball percentage and actual ground ball percentage was low, only 0.449 for all pitchers with 50 or more LD allowed (corresponding to about 80 innings pitched). Furthermore, the range of expected ground balls was far too flat:
As you can see, predicted GB% ranged from 40% to 50% while actual GB% ranged from 30% to 60%. Even if you regress GB% to the mean, some pitchers end up well above 50%; therefore, we can safely conclude that predicted GB% does not correlate well with actual GB%.
I suspect the main reason for this low correlation is limitations in Gameday's pitch classification model. With a better pitch classification algorithm, the correlation would probably increase. Furthermore, my model only considered pitch location and pitch type; velocity and movement probably also play a major role in determining batted ball types. Originally, I thought that pitch type would duplicate pitch movement, since all pitches of a particular type have roughly the same movement; however, it appears that different movements of the same pitch lead to different batted ball results. In conclusion, we cannot determine batted ball types solely based on Gameday's pitch types and pitch location data.
21 comments
|
0 recs |
Do you like this story?
Comments
Good stuff here Alex
I did something very similar with John Lannan a while ago, looking strictly at BABIP
http://www.hardballtimes.com/main/blog_article/whats-john-lannans-secret/
I came to roughly the same conclusion as you. Location by pitch type is simply not enough to predict stats. You need to include specific movement and velocity, count information and pitch sequencing in order to really get somewhere I think. The problem is it’s really freaking hard to include all of those variables, especially when you can’t really use a regression with the pitch f/x coordinates.
You could run a regression if...
accounted for the center of the strike zone as the origin and then recalculated the coordinates.
You can manipulate it further if linear regressions don't work.
To check for a logarithmic relationship, you can recalculate the coordinates as powers of e. (You’ll have to hold off the negative values until the end though, because e^-12 is a fraction instead of a large negative value.)
To check for an exponential/geometric relationship, you can run a natural log on the coordinates, the opposite of the above). Again, the negative coordinates would have to be tweaked (made positive before adjustment, then negative again after all calculations).
Can't you just break up the data into a lot of bins?
Using all the 2007-2009 data should give you enough data to split it into a few hundred or so bins. You could split by location, velocity, movement, count, and previous pitch.
The problem with using a regression is that the R^2 is going to be very low, because I seriously doubt that the data will show any kind of geometric or polynomial or exponential pattern. While some kind of a smooth function is preferable to bins, a regression will be way off for many values.
by Alex Krolewski on Dec 6, 2009 8:41 PM EST up reply actions
I am in the process of binning all of that data
The problem is that there are so many things you need to adjust for (movement, speed, count, runners on base or not, batter hand, pitcher hand, etc.), that some of the bins simply have too few pitches.
Also, it takes a sh—load of work, but I have no doubt it would be worth it.
by vivaelpujols on Dec 6, 2009 10:44 PM EST up reply actions
I think a Neural Net is the best way to go
I have no idea how to calibrate one, or apply it.
by vivaelpujols on Dec 6, 2009 10:46 PM EST up reply actions
Really?
It doesn’t take that many variables to identify a pitch – location (adjusted x and y), movement direction, movement magnitude, velocity.
The problem with trying to isolate a pitch’s type-and-location effect on batted balls is that you’d have to account for the rest of the at bat, how the pitcher set up the pitch that was put in play.
This might be a more useful study for looking at hitters than pitchers.
Whoops.
Missed the “count, and previous pitch” note. You’d have to include the previous pitch’s location, movement, and velocity. I don’t think the type alone is enough information. Not sure which you were talking about.
Ideally, you would include every single parameter you could think of
Unfortunately that would results in some 10,000 different bins, which would kill your sample size and make the results practically unworkable. Also, it would probably take several days to code and would drive the codee insane.
Therefore, you need to identify the parameters which are most important. In my opinion, those are the 5 major pitch attributes (velocity, vertical spin deflection, horizontal spin deflection, x location and y location), the count, whether or not there are runners on base (it’s debatable whether this is important or not) and the previous pitch type. Even those might be too many.
Runners on base should be important
Due to the presence of defensive shifts (1B playing closer to the bag w/a runner on first). I think someone’s done some work on this before (BABIP by base-out state), but I’m not sure…
Anyway if you’re going to use runners on base you probably only need 2 bins—runner on 1B and no runner on 1B.
by Alex Krolewski on Dec 7, 2009 6:14 PM EST up reply actions
If you're going to add that...
You can also add the type of contact made (line drive, pop put, ground ball, fly ball) and the hit location. Hit location would require bins, though.
This still won’t account for defensive shifts on hit-and-runs or run-and-hits.
by NoNameOnCard on Dec 8, 2009 12:27 PM EST up reply actions
You need to pick your battles
Find the 7-8 most important parameters and just organize by those, or else the results will be worthless.
For a stats study, yes.
But if you’re trying to relate it to something in real life, you can’t ignore obvious things like where the defense is playing and how the batter was set up. If you can’t account for those, you’re basically wasting your time.
It’s like learning how physics works in a frictionless vacuum. Ok, great, now how does it really work?
Disagree
When you incorporate too many variables, the sample sizes get too small. Here is an example. Let’s look at one bin, with the following parameters:
*RHP to a RHH
*No runners on
*0-0 count
*Fastball between 91 and 93 MPH
*Vertical spin deflection between 8 and 12 inches
*Horizontal spin deflection between -3 and -7 inches
*Horizontal location between -.2 and .2 feet from the center of the plate
*Vertical location between 2.3 and 2.7 feet off the ground
That’s the bare minimum amount of parameters that I can think of, and using reasonable bins for movement, speed and location. It’s also a very generic situation. An fastball with average movement and velocity right down the middle, on the most populated count and batter/pitcher hand in baseball.
Over the past 3 years, I get a total of 244 pitches with an average rv100 (run value for 100 pitches, lower is better for the pitcher) of .40. Sounds reasonable right?
Well, let’s see what happens when we change the speed limit to 93-95 MPH. We then get 180 pitches with a 1.36 rv100. That means that all other things being equal, a 91-93 MPH fastball in that situation is significantly better than a 93-95 MPH, by the tune of about 1 run per 100 pitches.
Does that make sense at all? Of course not. The only explanation is that our samples are simply too small. We only have 180 pitches in the second bin, which is not nearly a reliable sample.
And that was only using the bare minimum parameters, for the most common pitch in baseball. What happens if you have that same pitch (between 91 and 93 MPH) on a 2-2 count? Well you only get 22 pitches, which is obviously worthless.
You want to add more parameters? The results are already unworkable, adding more parameters would only give you 10 pitches in some bins, and none in others.
We have to think of a better way to do this.
by vivaelpujols on Dec 10, 2009 12:55 AM EST up reply actions
How does Dave Allen produce his heat maps?
Doesn’t he essentially fit a 3-dimensional function to the data in order to get the heat graph? I’m pretty sure it’s not a regression, since the function isn’t linear or parabolic or exponential; he must be using some kind of a smoothing function.
by Alex Krolewski on Dec 10, 2009 1:41 AM EST up reply actions
He uses a regression surface
Using either latices (which basically creates individual points at various places in the strike zone to which you use a starting point… it’s like a peicewise function, I think) or a LOWESS regression fit (that’s a polynomial equation that’s particularly adaptive based on the closest points). He then creates a matrix using the predicted values and plots them out image(x,y,z).
The problem is that neither of those two methods emits a closed form equation, and like any regression, the more parameters you add, the less effective it is. A LOWESS regression works OK for just px and px on run value, but if you add data for movement and velocity, I doubt it would work.
You should ask Dave though, he might had thought of a way to do this better.
by vivaelpujols on Dec 10, 2009 2:51 AM EST up reply actions
This lands into the "wasting your time" category.
EIther you don’t have enough of a sample size OR your don’t have enough variables or – more likely – you don’t have enough of a sample AND you don’t have enough variables.
You can’t account for everything simply by increasing the sample size. A fastball right down the middle looks a lot different to batter after having seen (swing or take) a splitter just under the strike zone.
by NoNameOnCard on Dec 12, 2009 4:54 PM EST up reply actions
We're not aiming for perfection
Or, exactly how good each pitch in baseball is, we’re just looking to get a sense. If increasing the sample by limiting your parameters to the most important ones, you can get a decent guess at how effective each pitch is.
by vivaelpujols on Dec 12, 2009 7:02 PM EST up reply actions
Seems like a lot of work...
just to get “a sense” of how good a pitch is.
The set-up pitch absolutely should be accounted for in some way. I think it’s more important than velocity. As such, I think you should increase your velocity segments (84-87,88-91,92-95,96-99) to increase your bin sizes while trying to account for set-up pitches or some other logical necessity.
by NoNameOnCard on Dec 13, 2009 4:24 PM EST up reply actions
From B-R:
empty .297
1st .312
2nd .289
3rd .300
1st & 2nd .291
1st & 3rd .329
2nd & 3rd .296
Loaded .307
3rd less than 2 out .324
3rd 2 out .293
So I’d say that base-situation is very important, at least for 1st and 3rd and runners on 3rd less than 2 out.
by Alex Krolewski on Dec 7, 2009 7:00 PM EST up reply actions

by 

















