Navigation: Jump to content areas:


Pro Quality. Fan Perspective.
Login-facebook
Around SBN: Raiders' GM Begins The Purge

Advanced Graphing Techniques Part 4 - Cluster Analysis in R

If you have ever worked with the raw pitch f/x data, you would have likely noticed that MLBAM's real time pitch classification isn't always accurate. One can classify pitches on their own using Excel (or any other spreadsheet program) for individual games, but it becomes rather difficult and time consuming when the data is immense, say a starting pitcher's data from the past two seasons.

Most pitchers have pretty defined clusters for each of their pitches on a plot of three pitch f/x variables: pfx_x (horizontal spin movement), pfx_z (vertical spin movement), and starting velocity. This is where R becomes quite useful. It has the power to use cluster analysis for multiple variables. I'm still a novice in R so I will work with k-means clustering which is pretty simple to work with in R.

 

Here is the step by step process.

1. Follow Part 1 if you don't have a database yet.

2. In Part 2, follow steps 1-7 if you don't have R installed and connected to your database. When R is installed, follow step 8 to connect it to your database in RGui (the interface of R).

3. To use cluster analysis, you need to install the package cluster in RGui. On the top bar, click packages and scroll down to "Install Package(s)..." and click it. Select the appropriate CRAN mirror and when the list of packages comes up select "cluster" to install it. Then either go to packages on the top bar and select load packages and click on cluster or run the command library(cluster).

4. Now for the query. We need to gather data for the three variables mentioned earlier for an individual pitcher. I usually grab it in individual years to remove as much noise as possible or if the pitcher changed his movement/velocity on some of his pitches (rare). And you want to take out pitches that were pitchouts or intentional balls as they are not real pitches.

query <- sqlQuery(channel, "select p.pfx_x, p.pfx_z, p.start_speed from atbats a join pitches p on p.ab_id = a.ab_id join games g on g.game_id = a.game_id join players pl on pl.eliasid = a.pitcher where pl.last='verlander' and year(g.date)=2009 and p.pfx_x is not null and (pitch_type<>'in' and pitch_type<>'po') order by p.pitch_id asc;")

 

Keep the ORDER BY part as it makes sure your query is in a particular order (this becomes useful later).

5. Once you have the query you want, time to run the cluster analysis. These are commands to run the analysis and then plot them into R.

cl<-kmeans(query,4)

The number corresponds to number of clusters you want. In this case, Justin Verlander only has four pitches: fastball, changeup, slider, and curve. Now to plot them use this command.

plot(query,col=cl$cluster)

You should now have a plot that appears like this.

Plotcluster_medium

via img72.imageshack.us

 

 

As you can see, all three variables are plotted and there are six nice graphs plotting each against each other. The one on the far left is the x and y plot, the common pfx_x and pfx_z graph (spin movement). Under that plot is the pfx_x by velocity plot which visually brings out any pitch clusters quite clearly. For the sake of learning how to run cluster analysis, I will leave the problems associated with this step at the end of the post and continue on with the process.

6. Now that you have the clusters you want, it is time to put them in a spreadsheet in order to reclassify any missed pitches but more importantly, place them into a MySQL database. The easiest method I found is to create a csv file, then upload that file into Excel.

write.csv(cl$cluster,file="c:/baseball/verlander09.txt")

You can create any file path you want as R will make a text file directly from its software. But this text file only contains a number that corresponds to the cluster R assigned for each pitch. You then need to get the pitch id from your database and some other variables to know which cluster is which in Excel. Using the same query as before, just add the pitch_id in the select portion of the query.

query1 <- sqlQuery(channel, "select p.pitch_id as id, p.pfx_x, p.pfx_z, p.start_speed from atbats a join pitches p on p.ab_id = a.ab_id join games g on g.game_id = a.game_id join players pl on pl.eliasid = a.pitcher where pl.last='verlander' and year(g.date)=2009 and p.pfx_x is not null and (pitch_type<>'in' and pitch_type<>'po') order by p.pitch_id asc;")

Now create another text file that contains this query.

write.csv(query1,file="c:/baseball/verlander09q.txt")

7. To load up the text files into Excel, click on the data tab and click on From Text in the get external data section. Load up both files individually into the spreadsheet. Click next on the first window but when the second window comes up, click on Comma in the delimiters section. This makes sure that Excel separates data into columns. Once you have organized the data known which pitch each cluster number is, you can put those pitch ids into the database.

8. If you are using the database from Part 1 or Mike Fast's method, there should be a column called my_pitch_type in the pitches table. This is where you will place your pitch classifications. Most MySQL guis should have a function that allows you to change the structure of your tables. In SQLyog, open the tables in your database and right click on any column type and click manage columns.

Mysqlchange_medium

via img705.imageshack.us

 

The default data type for my_pitch_type is tinyint which is for numbers. Changing it to varchar will allow MySQL to input letters into the column. Once you have changed it, you can now input your pitch classifications into the database using the UPDATE function. This is how the function should look like.

UPDATE pitches SET My_pitch_type='FF'

WHERE pitch_id=

OR pitch_id=

Of course you will have multiple pitch ids. To do this, just go into a new sheet in Excel and fill an entire column with OR pitch_id= using the drag and drop method. In the next column, copy and paste the pitch ids from the previous sheet for the pitch type you want.

Then do this for all the pitch types and copy and paste the columns into your MySQL gui and it is now set to run it. Just remember to change the pitch type in the SET function for each pitch type you insert and keep it simple by using the same labels pitch f/x uses in its own your database!

 

Potential Problems

Going back to step five, which is the most crucial part of the process, this is where you just have to play around in R. Sometimes the clustering doesn't look like you want it to. You can just run the cluster function and plot it again to see if it returns some more desirable results. If a pitch was used with little frequency, it is likely k-means won't register it as an individual cluster. You might have to change the number of clusters k-means finds in order to get the most accurate results.

This part depends highly on what the pitcher throws in the eye of pitch f/x. You might have to cut the sample size to half seasons, or run a individual k-means on pitches that are tightly clustered, for example fastballs and changeups or two and four seam fastballs. Also remember that this cluster analysis doesn't take into account individual ballparks or changes in arm angle. So you may have to cut the sample size by ballpark or date although this rarely should happen/

This isn't a perfect analysis and there may be a better cluster function in R but it does help out a whole lot when trying to id pitches in a large sample size and it produces good results most of the time.

 

If you have any questions and comments about this post or any of the other parts of this series, please feel free to address it in the comments section.

Comment 11 comments  |  9 recs  | 

Do you like this story?

Comments

Display:

Awesome RZ

So I just clustered up Tommy Hanson and compared him to my hand done game by game classifications. The cluster method got it 98.7% right… goddamnit that’s good. Gameday, on the other hand, got it 93% right. So at least for Hanson, the K-Means is more accurate.

Hanson is also pretty easy to classify, so I’ll try to dig up some of the other pitcher’s I’ve classified like Jarrod Washburn who has like 6 pitches, which for some reason I haven’t updated into my DB yet.

Anyway, thanks a lot RZ. I wonder if there isn’t an easier way to output the pitch_id and pitch_type with just one query though.

by vivaelpujols on Mar 9, 2010 11:58 PM EST reply actions  

Okay

So I dug up my Washburn file, which I spent about 2 hours classifying just 487 pitches – he was hard.

Gameday was 68% accurate in that sample according to my classifications, and the K-Means cluster one I just created in like 20 seconds was 90% accurate – and that’s assuming that mine are perfectly accurate as well, which probably isn’t correct either. So, wow, that’s really good and Washburn is about as hard of a pitcher to classify as there is out there.

So I think I’m comfortable using K-Means as a quick and dirty classification algorithm to get a pretty high accuracy (at the moment it looks to be much better than gameday), without having to go through the rigor of doing it game by game. With park adjusted data, it should be almost perfect.

by vivaelpujols on Mar 10, 2010 12:49 AM EST up reply actions  

Basically, I think the most important part is first determing which pitch types each pitcher throws

This could be done by eyeballing Pitch f/x data, or looking at scouting reports or what have you. Then after you’ve identified the name of each pitch and how many there are, you do the K-Means formula in R, download to CSV and sort by cluster number. Mix and match with the original data to see which cluster number refers to which pitch type, fill them in and viola – 95% accurate pitch classifications in 20 minutes.

by vivaelpujols on Mar 10, 2010 12:58 AM EST up reply actions  

I usually check out the pitch f/x charts on texas leaguers first

if I don’t how a pitcher’s pitches look like in pitch f/x. I remember trying to do Derek Lowe in k-means but his was not as clean as say Verlander. I think if a pitcher has a huge random cluster, like Lowe’s sinker, it might throw off the k-means some.

rzar.wordpress.com
draysbay.com
raysprospects.com

by RZ on Mar 10, 2010 9:49 AM EST up reply actions  

Is it easier to do 2009 data, vice the earlier stuff

The camera seem calibrated better and may be less variation.

- .-. ..- … – / – …. . / .—. .-. - .. . … …

by Jeff Zimmerman on Mar 10, 2010 10:27 AM EST up reply actions  

RZ -- thanks for doing this

This should keep me busy for a few days.

- .-. ..- … – / – …. . / .—. .-. - .. . … …

by Jeff Zimmerman on Mar 10, 2010 12:10 AM EST reply actions  

Very cool!

Its amazing how many of these pitches are wrongly ID’d. To me, based on the graph, it appears that the ID process takes place in start_speed over pfx_z.

RZ, how do I get these colors to show up when I just want one of these graphs, rather than all 9?

www.bullpenbanter.com
Follow me at http://twitter.com/JDSussman
Follow the site at http://twitter.com/bullpenbanter
Remember: baseball guys... baseball...

by JD Sussman on Mar 10, 2010 12:18 PM EST reply actions  

Easy

plot(query$x,query$y,col=cl$cluster)

Whatever variables you put in a query in R, you can individually identify them by the query name, then a $ sign, and finally the name of the variable you used (in this case x and y). The first variable would be plotted in the x axis and the second one in the y axis. Then keep the col=cl$cluster so that R knows you want the cluster analysis plotted.

rzar.wordpress.com
draysbay.com
raysprospects.com

by RZ on Mar 10, 2010 2:46 PM EST up reply actions  

Would it be more accurate to cluster pitches on a game by game basis because of stadium differences? And if so, is there an easy way to do so without having to write a query for each specific game?

by TheAnswer1313 on Mar 11, 2010 10:05 AM EST reply actions  

if you have the time to cluster game by game

I am sure there is a way to that in a query but I don’t know what command to use for R to just take the parameters for a certain date.

rzar.wordpress.com
draysbay.com
raysprospects.com

by RZ on Mar 11, 2010 11:35 AM EST up reply actions  

Comments For This Post Are Closed


User Tools

We use numbers and stuff.
Community Guidelines
Why be a member?

FanPosts

Community blog posts and discussion.

Recent FanPosts

Img_3830_small
BtBS Fantasy League
Small
Context Neutral Run and RBI projections
Small
Free Agent Compensation
Img_0001_small
Value of Various Plate Approaches
Strike_three2_small
Effect of Foul Area on Strikeouts: AL 1954-68: Erratum
Small
Baseball on a stick
Small
Player Evaluating Statistic
Baseball_small
Rays Outfield: Cheap but Extremely Productive
Small
A new xBABIP
Small
Jack Morris "pitching to the score"

+ New FanPost All FanPosts >

Follow us on Facebook!

Follow us on Twitter!

SaberGraphics

MLB Daily Dish

Get the latest MLB Trade Rumors, Transactions, and News at MLB Daily Dish!


Managing Editor:

Jbopp-kc_small Justin Bopp

Columnists:

Adam_small adarowski

Dme_small Satchel Price

Closeup4_small J-Doug

Carlosicon_small Julian Levine

Billy_and_daddy_4th_of_july_small Bill Petti

Featuring:

Dayton_small Jeff Zimmerman

12475953_small Jacob Peterson

Picture-6_small Chris St. John

Btbpro_small Dave Gershman

229331_10150183361996591_674441590_6760167_6637860_n3_small Lewie Pollis

Img_3830_small David Fung