Experimenting With Clustering - Offense
This post originated out of me asking myself, "Self, if you were going to delve into the world of projecting offense, how would you go about it?" My answer was that I’d take a basic Marcels approach and add in some additional regression/weighting based on batted ball (plus a little extra) profiles. That approach would require me to bin players based on batted ball profiles, so I immediately thought of k-means clustering using R. The rest of this post is my brief exploration of batted ball profile clustering.
Using Fangraph’s 2009 stats (filtered to just the qualifiers) I created clusters based on the following sets of statistics.
| LD | GB | FB | IFF | |
| LD | GB | FB | IFF | HR |
| LD | HR | BB | ||
| HR | BB | K | ||
| GB | FB | ISO | SPD | |
| BB | K |
IFF = In Field Fly, HR = HR/FB%
The full lists of clusters can be found here, and I’ll discuss some of the things I found interesting after the jump
Not surprisingly the sets of stats that included some version of walk rate did better (anecdotally at least) at clustering the good (based on wOBA) hitters from the bad hitters, but if one wishes to just look at the physical batted ball profiles, then adding in HR/FB weeds out some of the noise. I found it mildly amusing that if you only look at batted ball types (excluding HF/FB) that Yuniesky Betancourt and Albert Pujols fall in the same cluster. The set of clusters I decided to focus on were the ones based on LD, HR/FB, BB. Here are the cluster centers for it, along with the average wOBAs of each cluster.
| Cluster | LD | HR/FB | BB | wOBA |
|---|---|---|---|---|
| 1 | 17% | 17% | 10% | 0.369 |
| 2 | 23% | 6% | 14% | 0.353 |
| 3 | 19% | 10% | 5% | 0.330 |
| 4 | 16% | 9% | 8% | 0.326 |
| 5 | 20% | 13% | 9% | 0.362 |
| 6 | 18% | 12% | 13% | 0.363 |
| 7 | 19% | 23% | 15% | 0.398 |
| 8 | 19% | 4% | 7% | 0.315 |
| 10 | 19% | 18% | 14% | 0.392 |
And here are a couple guys that stand out by having a low wOBA relative to their cluster (potential for improvement maybe?)
| Name | LD% | HR/FB | BB% | Cluster | wOBA | Cluster wOBA |
|---|---|---|---|---|---|---|
| Brandon Inge | 15% | 15% | 9% | 1 | 0.315 | 0.369 |
| Jack Cust | 20% | 18% | 15% | 10 | 0.342 | 0.392 |
| Alfonso Soriano | 19% | 12% | 8% | 5 | 0.314 | 0.362 |
| Russell Martin | 21% | 5% | 12% | 2 | 0.307 | 0.353 |
| Mark DeRosa | 17% | 15% | 8% | 1 | 0.327 | 0.369 |
| Dan Uggla | 17% | 16% | 14% | 10 | 0.354 | 0.392 |
Finally, I took a look at the players that had lower numbers of PAs (100-300) to see what cluster they fell in (note: I didn’t re-cluster, just examined which center these players were closest to). The following fell in the high wOBA clusters (1,5,6). Note: clusters 7 and 10 had no players fall in them.
| Name | Team | PA | LD% | HR/FB | BB% | wOBA | cluster |
|---|---|---|---|---|---|---|---|
| Randy Ruiz | Blue Jays | 130 | 11% | 31% | 8% | 0.428 | 1 |
| Kyle Blanks | Padres | 172 | 13% | 21% | 11% | 0.372 | 1 |
| Rickie Weeks | Brewers | 162 | 19% | 19% | 8% | 0.365 | 6 |
| Justin Maxwell | Nationals | 102 | 14% | 19% | 12% | 0.357 | 1 |
| Rocco Baldelli | Red Sox | 164 | 18% | 17% | 7% | 0.326 | 6 |
| Ryan Raburn | Tigers | 291 | 15% | 17% | 9% | 0.378 | 6 |
| Drew Stubbs | Reds | 196 | 21% | 17% | 8% | 0.335 | 5 |
| David Ross | Braves | 151 | 22% | 16% | 14% | 0.386 | 5 |
| Matt Stairs | Phillies | 129 | 11% | 16% | 18% | 0.327 | 1 |
| Brandon Allen | Diamondbacks | 116 | 17% | 16% | 10% | 0.288 | 6 |
| Landon Powell | Athletics | 155 | 18% | 15% | 9% | 0.315 | 6 |
| Ramon Castro | - - - | 171 | 22% | 14% | 9% | 0.304 | 5 |
| Marcus Thames | Tigers | 294 | 18% | 14% | 10% | 0.329 | 6 |
| Travis Snider | Blue Jays | 276 | 15% | 14% | 11% | 0.327 | 6 |
| Mat Gamel | Brewers | 148 | 27% | 13% | 12% | 0.332 | 5 |
| Jayson Nix | White Sox | 290 | 13% | 13% | 10% | 0.319 | 6 |
| Carlos Delgado | Mets | 112 | 20% | 13% | 11% | 0.394 | 5 |
| Eric Hinske | - - - | 224 | 18% | 13% | 12% | 0.344 | 6 |
| Andres Torres | Giants | 170 | 17% | 13% | 10% | 0.379 | 6 |
| Alex Gordon | Royals | 189 | 14% | 12% | 11% | 0.321 | 6 |
| Gabe Kapler | Rays | 238 | 23% | 11% | 12% | 0.334 | 5 |
| Chris Snyder | Diamondbacks | 202 | 17% | 11% | 16% | 0.304 | 6 |
| Jesus Flores | Nationals | 106 | 18% | 11% | 11% | 0.375 | 6 |
| Chris Gimenez | Indians | 130 | 19% | 10% | 13% | 0.233 | 6 |
| Austin Kearns | Nationals | 211 | 19% | 7% | 16% | 0.298 | 6 |
Seeing as how I'm probably not going to venture into the world of projections (there's already plenty of people that do a much better job than I could) this all boils down to an interesting thought experiment. That being said I thought someone out there may have a use for the data.
12 comments
|
0 recs |
Do you like this story?
Comments
Ok, I've done some basic multivariate statistics
And have done some basic cluster analysis before (UPGMA & WPGMA), as well as principle components and discriminate function analysis…
But could you explain a little more what this approach is doing? What I think is going on is that you’re taking certain sets of input data, and it’s pulling out groups of players that seem to have commonalities based on those input data. I don’t quite get why the input data is redundant. You use LD’s several times, I think…are these separate runs, or one run that gets each of those “sets” of input data). Anyway, just a basic synopsis of what you’re doing would be really helpful.
-j
I write at:
Beyond the Boxscore | Red Reporter | Basement-Dwellers.com | Twitter: @jinazreds
Sorry, I should have been more clear
Each line in the sets of statistics portion is a separate run/analysis (i.e. I clustered based on the first line (LD, FB, GB, IFF) and got some results, then a clustered based on the 2nd line (LD, FB, GB, IFF, HR) and got some different results.) Does that answer your basic question?
The methodology was basically me taking various groups of stats (that I picked for no great reason other than curiosity) and seeing how they clustered players.
by stevesommer05 on Jan 29, 2010 12:20 PM EST up reply actions
Oooh, pretty!
I had also thought about doing soeme projection work based on cluster. The biggest problem is how to get a set of input variables that cover a player’s skillset whilst avoiding co-linearity. It’s a tall order, but if done right, it’s the sort of thing that would be an amazing step forward.
Methodological question: Were you just asking for 10 clusters a priori, or did you have some sort of criteria for knowing when to tell the proximity matrix to stop combining?
A priori unfortunately
Were I to do it for real I’d have to look into the research done on number of clusters and whatnot… of which I’ve only skimmed.
Agreed that the input variables are key (as Tango has pointed out today).
by stevesommer05 on Jan 29, 2010 12:30 PM EST up reply actions
Model Based?
This is really interesting stuff, and something I’ve been wanting to mess around with for a while. I think it’s underutilized in the baseball analysis.
What about model based clustering? Since you’re into using R, there is a model based clustering program under the package “Mclust” that works really well. It includes choosing shape, size, etc. of the clusters with a BIC chart/graph. With Model-Based, I think there’s a little less worry about your medoids/centroids gettings ‘stuck’ when the ‘initial centroid’ is in the wrong place. It also deals with strange shapes of the clusters a little better.
Unfortuntely
Unfortunately, there’s not much there to determine the number of clusters (well, actually there is a lot, just no one agrees on any of them, and researching could take you awhile). One of your best choices may be to simply cluster from k =2:n and see which you like best (this is fairly simple for k means, but you have to run it many times, it may be a better idea to look into hierarchical clustering using the R function hclust() ). Also, if you don’t input starting clusters for R, you’ll find its k-means algorithm is very inconsistent (it finds all sorts of local minima on repeated trials). If you implement k-means++ (very simple to implement), the results are much more consistent, and in my opinion, better.
Also, if you want to actually delve into clustering, I might recommend weighted kernel k-means/spectral clustering (they are essentially the same thing, just 2 different ways to formulate it. Both end up as a trace maximization problem.)
Thanks for the info
Yeah at this point it was just an interesting science experiment, and a way for me to dig into R a little more.
by stevesommer05 on Jan 29, 2010 4:20 PM EST up reply actions
Curiousity pushes me to ask...
You used a good subset of groupings, but why not the following set?
LD HR BB K
I would think that, following LD HR BB, you would find that wOBA is impacted by the number of PAs in which a batter makes contact in addition to what percentage came out as a LD — which would explain why Cust was on the low end of that one cluster, and might provide a better fit while avoiding co-linearity.
Clarifying
No good reason why I didn’t do that one. In fact I vaguely remember doing it (might have done it in a different spreadsheet on another computer)…
by stevesommer05 on Jan 29, 2010 9:28 PM EST up reply actions
I would guess that the more variables you include, the more clusters you'd expect to get out? Is that true? (I know nothing about this.)
How about this set of data:
BB/PA
SO/AB
HR/CON
BABIP
ISOBIP
If I could break it down even more, I’d try something like this:
BB/PA
SO/AB
OFFB, IFFB, LD, and GB per CON (balls in play plus HRs)
HR/(OFFB+LD)
H/(OFFB-HR), (H+RBOE)/GB, H/(LD-HR)
xbases/(OFFB-HR), xbases/GB, xbases/(LD-HR)
Certainly room for rearrangement.
Beyond the Boxscore Not a member? Sign up.
Back in the day...
I tried something like this, and I used 10 variables:
plate discipline measure (mine), swing%, contact%, LD, GB, and FB rates, speed score (mine), power score (mine), 2 strike foul rate, and non-2 strike foul rate.
I was most interested in the proximity matrix and how that might improve similiarity scores.

by 

















