Navigation: Jump to content areas:


Pro Quality. Fan Perspective.
Login-facebook
Around SBN: The Animated GIFs Of January

Experimenting With Clustering - Offense

 

This post originated out of me asking myself, "Self, if you were going to delve into the world of projecting offense, how would you go about it?"  My answer was that I’d take a basic Marcels approach and add in some additional regression/weighting based on batted ball (plus a little extra) profiles.  That approach would require me to bin players based on batted ball profiles, so I immediately thought of k-means clustering using R.  The rest of this post is my brief exploration of batted ball profile clustering.

Using Fangraph’s 2009 stats (filtered to just the qualifiers) I created clusters based on the following sets of statistics.  

 

LD GB FB IFF
LD GB FB IFF HR
LD HR BB
HR BB K
GB FB ISO SPD
BB K

 

IFF = In Field Fly, HR = HR/FB%

The full lists of clusters can be found here, and I’ll discuss some of the things I found interesting after the jump

Star-divide

 

Not surprisingly the sets of stats that included some version of walk rate did better (anecdotally at least) at clustering the good (based on wOBA) hitters from the bad hitters, but if one wishes to just look at the physical batted ball profiles, then adding in HR/FB weeds out some of the noise. I found it mildly amusing that if you only look at batted ball types (excluding HF/FB) that Yuniesky Betancourt and Albert Pujols fall in the same cluster. The set of clusters I decided to focus on were the ones based on LD, HR/FB, BB. Here are the cluster centers for it, along with the average wOBAs of each cluster.

Cluster LD HR/FB BB wOBA
1 17% 17% 10% 0.369
2 23% 6% 14% 0.353
3 19% 10% 5% 0.330
4 16% 9% 8% 0.326
5 20% 13% 9% 0.362
6 18% 12% 13% 0.363
7 19% 23% 15% 0.398
8 19% 4% 7% 0.315
10 19% 18% 14% 0.392

 

And here are a couple guys that stand out by having a low wOBA relative to their cluster (potential for improvement maybe?)

Name LD% HR/FB BB% Cluster wOBA Cluster wOBA
Brandon Inge 15% 15% 9% 1 0.315 0.369
Jack Cust 20% 18% 15% 10 0.342 0.392
Alfonso Soriano 19% 12% 8% 5 0.314 0.362
Russell Martin 21% 5% 12% 2 0.307 0.353
Mark DeRosa 17% 15% 8% 1 0.327 0.369
Dan Uggla 17% 16% 14% 10 0.354 0.392


Finally, I took a look at the players that had lower numbers of PAs (100-300) to see what cluster they fell in (note: I didn’t re-cluster, just examined which center these players were closest to). The following fell in the high wOBA clusters (1,5,6). Note: clusters 7 and 10 had no players fall in them.

 

Name Team PA LD% HR/FB BB% wOBA cluster
Randy Ruiz Blue Jays 130 11% 31% 8% 0.428 1
Kyle Blanks Padres 172 13% 21% 11% 0.372 1
Rickie Weeks Brewers 162 19% 19% 8% 0.365 6
Justin Maxwell Nationals 102 14% 19% 12% 0.357 1
Rocco Baldelli Red Sox 164 18% 17% 7% 0.326 6
Ryan Raburn Tigers 291 15% 17% 9% 0.378 6
Drew Stubbs Reds 196 21% 17% 8% 0.335 5
David Ross Braves 151 22% 16% 14% 0.386 5
Matt Stairs Phillies 129 11% 16% 18% 0.327 1
Brandon Allen Diamondbacks 116 17% 16% 10% 0.288 6
Landon Powell Athletics 155 18% 15% 9% 0.315 6
Ramon Castro - - - 171 22% 14% 9% 0.304 5
Marcus Thames Tigers 294 18% 14% 10% 0.329 6
Travis Snider Blue Jays 276 15% 14% 11% 0.327 6
Mat Gamel Brewers 148 27% 13% 12% 0.332 5
Jayson Nix White Sox 290 13% 13% 10% 0.319 6
Carlos Delgado Mets 112 20% 13% 11% 0.394 5
Eric Hinske - - - 224 18% 13% 12% 0.344 6
Andres Torres Giants 170 17% 13% 10% 0.379 6
Alex Gordon Royals 189 14% 12% 11% 0.321 6
Gabe Kapler Rays 238 23% 11% 12% 0.334 5
Chris Snyder Diamondbacks 202 17% 11% 16% 0.304 6
Jesus Flores Nationals 106 18% 11% 11% 0.375 6
Chris Gimenez Indians 130 19% 10% 13% 0.233 6
Austin Kearns Nationals 211 19% 7% 16% 0.298 6

 

Seeing as how I'm probably not going to venture into the world of projections (there's already plenty of people that do a much better job than I could) this all boils down to an interesting thought experiment. That being said I thought someone out there may have a use for the data.

Comment 12 comments  |  0 recs  | 

Do you like this story?

Comments

Display:

Ok, I've done some basic multivariate statistics

And have done some basic cluster analysis before (UPGMA & WPGMA), as well as principle components and discriminate function analysis…

But could you explain a little more what this approach is doing? What I think is going on is that you’re taking certain sets of input data, and it’s pulling out groups of players that seem to have commonalities based on those input data. I don’t quite get why the input data is redundant. You use LD’s several times, I think…are these separate runs, or one run that gets each of those “sets” of input data). Anyway, just a basic synopsis of what you’re doing would be really helpful.
-j

by JinAZ on Jan 29, 2010 12:09 PM EST reply actions  

Sorry, I should have been more clear

Each line in the sets of statistics portion is a separate run/analysis (i.e. I clustered based on the first line (LD, FB, GB, IFF) and got some results, then a clustered based on the 2nd line (LD, FB, GB, IFF, HR) and got some different results.) Does that answer your basic question?

The methodology was basically me taking various groups of stats (that I picked for no great reason other than curiosity) and seeing how they clustered players.

by stevesommer05 on Jan 29, 2010 12:20 PM EST up reply actions  

Oooh, pretty!

I had also thought about doing soeme projection work based on cluster. The biggest problem is how to get a set of input variables that cover a player’s skillset whilst avoiding co-linearity. It’s a tall order, but if done right, it’s the sort of thing that would be an amazing step forward.

Methodological question: Were you just asking for 10 clusters a priori, or did you have some sort of criteria for knowing when to tell the proximity matrix to stop combining?

by pizzacutter on Jan 29, 2010 12:23 PM EST reply actions  

A priori unfortunately

Were I to do it for real I’d have to look into the research done on number of clusters and whatnot… of which I’ve only skimmed.

Agreed that the input variables are key (as Tango has pointed out today).

by stevesommer05 on Jan 29, 2010 12:30 PM EST up reply actions  

Model Based?

This is really interesting stuff, and something I’ve been wanting to mess around with for a while. I think it’s underutilized in the baseball analysis.

What about model based clustering? Since you’re into using R, there is a model based clustering program under the package “Mclust” that works really well. It includes choosing shape, size, etc. of the clusters with a BIC chart/graph. With Model-Based, I think there’s a little less worry about your medoids/centroids gettings ‘stuck’ when the ‘initial centroid’ is in the wrong place. It also deals with strange shapes of the clusters a little better.

by BMMillsy on Jan 29, 2010 4:22 PM EST up reply actions  

Unfortuntely

Unfortunately, there’s not much there to determine the number of clusters (well, actually there is a lot, just no one agrees on any of them, and researching could take you awhile). One of your best choices may be to simply cluster from k =2:n and see which you like best (this is fairly simple for k means, but you have to run it many times, it may be a better idea to look into hierarchical clustering using the R function hclust() ). Also, if you don’t input starting clusters for R, you’ll find its k-means algorithm is very inconsistent (it finds all sorts of local minima on repeated trials). If you implement k-means++ (very simple to implement), the results are much more consistent, and in my opinion, better.

Also, if you want to actually delve into clustering, I might recommend weighted kernel k-means/spectral clustering (they are essentially the same thing, just 2 different ways to formulate it. Both end up as a trace maximization problem.)

by kingofthehobos on Jan 29, 2010 4:12 PM EST reply actions  

Thanks for the info

Yeah at this point it was just an interesting science experiment, and a way for me to dig into R a little more.

by stevesommer05 on Jan 29, 2010 4:20 PM EST up reply actions  

Curiousity pushes me to ask...

You used a good subset of groupings, but why not the following set?

LD HR BB K

I would think that, following LD HR BB, you would find that wOBA is impacted by the number of PAs in which a batter makes contact in addition to what percentage came out as a LD — which would explain why Cust was on the low end of that one cluster, and might provide a better fit while avoiding co-linearity.

by Trickman on Jan 29, 2010 7:52 PM EST reply actions  

Clarifying

No good reason why I didn’t do that one. In fact I vaguely remember doing it (might have done it in a different spreadsheet on another computer)…

by stevesommer05 on Jan 29, 2010 9:28 PM EST up reply actions  

I would guess that the more variables you include, the more clusters you'd expect to get out? Is that true? (I know nothing about this.)

How about this set of data:

BB/PA
SO/AB
HR/CON
BABIP
ISOBIP

If I could break it down even more, I’d try something like this:

BB/PA
SO/AB
OFFB, IFFB, LD, and GB per CON (balls in play plus HRs)
HR/(OFFB+LD)
H/(OFFB-HR), (H+RBOE)/GB, H/(LD-HR)
xbases/(OFFB-HR), xbases/GB, xbases/(LD-HR)

Certainly room for rearrangement.

by Sky Kalkman on Jan 30, 2010 8:52 AM EST reply actions  

Back in the day...

I tried something like this, and I used 10 variables:

plate discipline measure (mine), swing%, contact%, LD, GB, and FB rates, speed score (mine), power score (mine), 2 strike foul rate, and non-2 strike foul rate.

I was most interested in the proximity matrix and how that might improve similiarity scores.

by pizzacutter on Jan 30, 2010 3:48 PM EST reply actions  

Comments For This Post Are Closed


User Tools

We use numbers and stuff.
Community Guidelines
Why be a member?

FanPosts

Community blog posts and discussion.

Recent FanPosts

Img_3830_small
BtBS Fantasy League
Small
Context Neutral Run and RBI projections
Small
Free Agent Compensation
Img_0001_small
Value of Various Plate Approaches
Strike_three2_small
Effect of Foul Area on Strikeouts: AL 1954-68: Erratum
Small
Baseball on a stick
Small
Player Evaluating Statistic
Baseball_small
Rays Outfield: Cheap but Extremely Productive
Small
A new xBABIP
Small
Jack Morris "pitching to the score"

+ New FanPost All FanPosts >

Follow us on Facebook!

Follow us on Twitter!

SaberGraphics

MLB Daily Dish

Get the latest MLB Trade Rumors, Transactions, and News at MLB Daily Dish!


Managing Editor:

Jbopp-kc_small Justin Bopp

Columnists:

Adam_small adarowski

Dme_small Satchel Price

Closeup4_small J-Doug

Carlosicon_small Julian Levine

Billy_and_daddy_4th_of_july_small Bill Petti

Featuring:

Dayton_small Jeff Zimmerman

12475953_small Jacob Peterson

Picture-6_small Chris St. John

Btbpro_small Dave Gershman

229331_10150183361996591_674441590_6760167_6637860_n3_small Lewie Pollis

Img_3830_small David Fung