@MacAree: @BtB_Sky @dturkenk @sabometrics Someone throw up a comment thread on BTB so we can have this discussion less confusingly?
Ok, Graham, here's your discussion thread. For those who don't stalk Graham on Twitter, these guys have been discussing the merits and deficiencies of potential Hit f/x data for judging hitters and fielders. 140 characters wasn't really enough. Let's see what happens next...
about 1 month ago
Sky Kalkman
89 comments
0 recs |
Comments
Probably not the discussion you were looking for, but as an aside
Is hit f/x data publicly available anywhere?
Ok, let's start
Hit f/x gives us the initial vector of a batted ball, in terms of velocity, field angle, and elevation angle. Will that be enough to get us accurate batted ball classifications?
How granular are we talking? The current FB/LD/GB schema? How many buckets are we aiming for?
And are we only considering height? Or will horizontal angle become part of a popular batted ball classification? (LDs and FBs have different angles for success than GBs. i.e. fielder gaps are different.)
As a thought, why even aim for a number of buckets?
Why not do the calculation for 1 through n buckets, where n is the number of batted balls that there’s data for, and then see which number of buckets optimizes predictive values?
That's true.
Just thinking about vertical buckets (like we use now), there are certainly going to be parts of the continuum where changes don’t matter and then parts where small change make a large difference in out rate.
Like on grounders, as an example: REALLY soft is easy out for catcher. Little harder is good chance of infield hit. Harder is routine grounder. More harder will get through holes/cause errors more often… etc
My belief is that what needs to happen is to take into account the seven parameters of batted ball trajectory
(vector is three, then spin is another three, then atmospherics) and run a clustering analysis to figure out the optimal number of buckets we’d want to look at. After that, you can take any batted ball you like and do a fuzzy means to figure out what buckets it should be shared between. It’s a quasi-continuous solution.
Oh, and Dan mentioned that pitch location would be important too
Personally, I don’t see it – there’s a very limited range where the ball can actually be hit, and ballistics will tell you that the starting point of an object inside a 2×2 cube won’t matter very much over baseball distances. I don’t see leaving it out as a huge deal compared to the gains from ignoring it when we’re computing outs and runs
It depends
whether the different outcomes of say a good curveball vs a hanging curveball can be captured completely with the other data we’d be using. Seems like a good chance of that happening, especially if we’re using landing location
I don't follow
Apart from location, the batted ball trajectory is ignorant of the pitch that came before, correct?
Will spin on the pitch
have an effect of the spin on the batted ball trajectory? I guess that would just be captured in this spin we are trying to calculate and thus it would be unnecessary to include?
Including data about the pitch might help us assign ‘credit’ to the pitcher for a bad pitch vs the hitter for hitting a good pitch or something – which should matter at some level, right?
I'm (fairly) sure pitch spin would effect batted ball spin
But wouldn’t change how the ball flies once contact is made and the spin for that is captured.
I agree with you that if we’re looking to apportion credit, we’ll want information about how difficult the pitch is to hit too.
by Dan Turkenkopf on Jul 21, 2010 4:49 PM EDT up reply actions
Spin
Craig, take a tennis racket and hit a slice or a top spin shot. Same effect here. Spin is extremely important.
Come check out Bullpen Banter!
Follow Bullpen Banter on Twitter
Follow me on Twitter
Remember: baseball guys... baseball...
Craig was asking if you know the spin off the bat, would the spin of the pitch have an effect beyond that.
I don’t think so, since any possible effects are being measured by the 6 parameters.
In other words, does knowing how the ball was hit lack any information that knowing how the ball was pitched would tell us?
by Sky Kalkman on Jul 22, 2010 10:28 AM EDT up reply actions
Not...
Not if we’re concerned solely with what happens with the ball once it’s hit – If we’re looking to judge fielders, essentially.
If we’re looking to judge hitters or pitchers, then we have to integrate info about the pitch… And not for judging fielders, unless we find they do something differently according to the pitcher on the mound. (I’m thinking of something similar to the more arcane UZR adjustments..)
Go Twins!
Right, you want to know about the pitch in order to judge what the hitter does with it.
But given that you know how the ball was hit, knowing about the inputs doesn’t tell you any more about the outputs.
Well
does the spin data also give you the spin acceleration/deceleration? If not, the pitch data could help you guess that. That’s the only application I could see, though.
Are you talking about spin-down
i.e., the decay of the spin rate throughout the flight of a fly ball? If so, its effect is believed to be negligible:
http://webusers.npl.illinois.edu/~a-nathan/pob/spindown.pdf
Winner, Beyond the Box Score 32 Predictions Contest, 2009
Yeah, I was
This article looks interesting, I’ll read it later, thanks.
If that’s the case, then I see nothing that pitch f/x should give you once the ball’s been hit about spin.
I'm not sure
Since I’ve forgotten how to calculate trajectories I’m using some online calculators.
It appears that the difference between a pitch hit 2 feet off the ground versus one hit 4 feet off the ground can range from just a few feet to almost 20 feet (all else equal of course). The higher the angle, the less of a difference it makes.
by Dan Turkenkopf on Jul 21, 2010 8:03 PM EDT up reply actions
Upon further review, I'm not sure the calculator I found was correct
It had an object with an initial velocity of 80 MPH at a 45 deg angle going 900 feet.
But the 20 ft case was roughly 230-250 ft I think. I’ll see if I can reproduce it with numbers that look a little closer to my expectations. Or god forbid, do my own math.
by Dan Turkenkopf on Jul 22, 2010 4:49 PM EDT up reply actions
Alan Nathan's trajectory calculator spreadsheet is good resource for this stuff
http://webusers.npl.illinois.edu/~a-nathan/pob/Full-3d-trajectory-7.xls
Winner, Beyond the Box Score 32 Predictions Contest, 2009
Why clustering?
Could you expand on what you mean by this?
Essentially, we're looking at a 7-D matrix of all these batted ball parameters
We might find that there are patterns within the data that lend themselves to being seeds for our buckets, and there are algorithms you could run across the whole data set in order to identify how many seeds would be optimal in terms of group separation and accuracy. The last time I looked at this sort of thing was three or four years ago, though, so I don’t remember exactly how it was done.
I see
The only clustering I know of is k-means clustering. But that doesn’t tell you the optimal number of clusters unless you do some sort of cross-validation.
Okay, but
then it’s really the validation part that we’re more concerned of, no? I mean, that’s what’s going to decide what value of k is used, isn’t it?
Sure, but I don't think that that's a particularly difficult thing to do
It’s computationally intensive, but we’d only have to do it once for all of batted-ball space. I forget the technical definitions of ‘good’ clusters vs ‘bad’, but I know that there are ways of doing it.
Oh, I agree comopletely, it's not hard
but at that point, what’s the advantage of using k-means clustering? Nothing’s going to give you the advantage of skipping the validation, so most regression/learning methods will work. Some will probably work better.
I also forget how you evaluate the quality of a cluster; I’ll have to look it up. I think it has to do with how likely another test point is to fall in a cluster, and the tightness of that cluster…I don’t know, I have to look it up.
After looking it up
The goal is to get as many “similar” clusters as possible…nothing is inherently good or bad about a cluster on its own, it’s a total similarity function of the entire set.
As much as I think GB/LD/FB is important now, I also think that they're something we should be looking to get rid of
We don’t know how many buckets we should be using, the exact definition of the buckets, etc. Seems to imply that we should be moving towards a continuous analysis. Furthermore, hit/fx won’t solve the problem anyway, because it’s missing some of the batted ball parameters (particularly spin) which might impact run/out value significantly
Right, so you pointed out that landing location is a good replacement for spin (because you can calculate spin knowing it).
But unfortunately
we won’t have trustworthy landing locations until Field F/X is around and that is going to take quite a while. Hit f/x will definitely be around much earlier
You gotta think landing location is one of those things that you could do reasonably well without technology, though.
Not that someone doing it would give it away, I guess.
Can't bank on
Field F/X (or Hit F/X for that matter) being given away either.
I don’t know how much I trust current landing location data or whose is the best. For HRs I think Hit Tracker is generally considered tops but for other balls in play?
Most analysis of landing location is done by video
Which is currently limited to the information captured by TV cameras. There’s often no real clues as to the exact position of the ball.
Might not be an issue for analyzing batting, but probably would be for fielding.
That’s actually an important question to answer here – what are we trying to analyze? And if we optimize for one part of the game, do we improve or hurt our understanding of the others?
by Dan Turkenkopf on Jul 21, 2010 4:28 PM EDT up reply actions
Not to speak for Graham (ok, to speak for Graham), but continuous would be best
No buckets at all
Personally, I think we’re going to bucket at some point – so we’ll want consistent definitions. I need to think more about themastah’s suggestion.
by Dan Turkenkopf on Jul 21, 2010 4:19 PM EDT up reply actions
I haven't read the whole discussion yet, but I thought I should post this
Mike Fast showed a graph or two on Tango’s blog
http://www.insidethebook.com/ee/index.php/site/comments/launch_angle_speed_off_the_bat_trajectory/#2
It seems as though there is way too much overlap to separate things into the 4 batted ball classifications. I think a continuos model using LOESS or some thingy would be better.
by vivaelpujols on Jul 21, 2010 6:07 PM EDT up reply actions
What applications are being considered?
I feel one of the major uses would be a replacement to wOBA. A more refined tRA/tRA*/tRAr?
Right now we're pretty much outcome-based or scouting-based when trying to detect changes for batters
If the 7 params Graham mentioned change beyond some reasonable fluctuation, we can infer a change in approach/talent/health I think.
by Dan Turkenkopf on Jul 21, 2010 4:35 PM EDT up reply actions
Hmmm
Possibly…but I’m worried (without looking at the numbers) that there will be so much noise on this level that it will be difficult to tell.
Understandable
But it’s probably worth trying.
by Dan Turkenkopf on Jul 21, 2010 4:38 PM EDT up reply actions
It's everything we try to do right now with limited data.
For example, we compare this year and last year’s batted ball locations to see if someone isn’t hitting the ball as hard they were last year. Now:

With Hit f/x: measure batted ball velocity directly.
O.o
That you know that creeps me out.
Go Twins!
It was posted on Lookout Landing.
But of course Graham knew it intuitively too.
M's fan in PA, soon to be LA
by perfectstrat on Jul 22, 2010 1:42 PM EDT up reply actions
I prefer to believe
You’re psychic.
I hope you can accept this.
Go Twins!
Huh?
I thought the point of tRA is to be fielding indifferent.
If you know the difficulty of batted balls, then, well, you know how difficult they are to field.
It’s like the rather unpublicized PZR. UZR has to know how difficult every ball is to turn into an out. That’s how it judges fielders. Well, if you know how difficult every ball is to turn into an out, you credit/blame the pitcher for allowing that batted ball and then credit/blame the fielders from that point on (for making/not making a play).
Which of course still assumes our current definition of fielding = positioning + range + hands (essentially)
by Dan Turkenkopf on Jul 21, 2010 4:39 PM EDT up reply actions
Yep.
How much need is there to separate range and hands? Positioning is obvious, as that can be (ideally) 100% coached.
We're pretty good at separating range and hands already I think
Errors count against hands.
Obviously we’re tripped up when a player gets to a difficult ball but bobbles it, but my guess is that’s well within the noise of the stats.
Positioning versus range is the major sticking point.
by Dan Turkenkopf on Jul 21, 2010 4:44 PM EDT up reply actions
But tRAr is different than tRA*
The numbers for certain players have changed, and some results are outright wonky. I could give you examples if you want when work ends.
Also
The regression style has changed. It now uses past data for players.
Okay, maybe I'll e-mail him
I know that the 2009 tRAr value for Pineiro definitely is not the same as the tRA* value…which is no longer available, but was certainly in the 3’s.
Using past data is a really interesting topic (at least to me).
Say we observe a pitcher with a 7% HR/FB rate over 100 IP. We assume that’s not his true talent level and regress it towards 11%. How much? Not sure. Say 75%.
Now what if we also know that this pitcher posted a 7% HR/FB in 2009 and 2008 and 2007. Ignoring park effects, we’re a LOT more sure about his talent level now.
But should a metric of his 2010 performance take that into account? Say he posted a 13% HR/FB in 2007-2009. Or say he’s a rookie. Observed 2010 stats are all the same — can we judge their 2010 performances differently because of pre-2010 performance? I can see arguments in both directions.
I argue that you can and should regress components to a players past mean to get a better sense of what he actually "did" that seaosn
But I don’t really think there is a need for it. You might as just well use a projection system.
by vivaelpujols on Jul 21, 2010 6:09 PM EDT up reply actions
I'm kind of against using past data as a metric
It’s fine for projection systems, since there’s “two layers of guessing” there, so to speak, but not when you’re just trying to fit the metric. Players’ mechanics can change so much from year to year. Take Pineiro, in fact. He was a mostly league average pitcher until last year, when Dave Duncan had him add a groundball repetoire. Should we still hold his past against him?
Well, maybe you could adjust for significant changes.
Changes in GB rate, pitch selection, pitch movement, velocity, etc. Those things could even help define the baseline against which you regress.
I suppose one could do that, but...
I don’t think we have analysis developed nearly enough nowadays to know how much to adjust.
I think
that you can’t throw out past data just because some people develop in certain ways. It seems to me that for every Piniero (and remember, nobody was sure if his GB rate would stay up after that one season) there are a bunch of guys who do not maintain their success.
It reminds me of the fangraphs pieces at the beginning of the year about every player who was in the best shape of their career or added a new pitch. Sometimes it will make a big difference but that’s just something you have to mentally account for until the data backs it up, right?
It would be nice if we could take into account things like a very effective new pitch, though
I remember all the projection systems being totally unable to deal with JJ Putz after 2006, because they had no idea how good his new splitter was.
Say you had the pitch f/x numbers
Could one do a k-nearest neighbors analysis (or something) on a pitch using a database of pitch f/x data, find the run value of that pitch using the most similar pitches, guess the frequency, and then adjust the wOBA against/whatever accordingly?
Sorry…I hope that made sense….I’m half asleep.
What does...
What does “guess the frequency” mean?
I think you’d need to do some regressing as well, to really understand, that or get huge sample sizes.
Go Twins!
Well, you're not just arbitrarily throwing out data.
If you want to say what a player is GOING to do for the rest of 2010, you certainly want to include 2009, 2008, etc. data. But for saying what he’s done for far in 2010? We don’t have to estimate how many runs a pitcher has given up from the start of the season until now – past seasons’ data cannot improve our accuracy at measuring what we know for certain has happened this year.
And here we get in to the neverending debate between...
Is it a projection of future performance, or an analysis of the underlying true talent in a current performance?
And how different are those two things?
Go Twins!



























