Advanced Graphing Techniques Part 4 - Cluster Analysis in R

If you have ever worked with the raw pitch f/x data, you would have likely noticed that MLBAM's real time pitch classification isn't always accurate. One can classify pitches on their own using Excel (or any other spreadsheet program) for individual games, but it becomes rather difficult and time consuming when the data is immense, say a starting pitcher's data from the past two seasons.

Most pitchers have pretty defined clusters for each of their pitches on a plot of three pitch f/x variables: pfx_x (horizontal spin movement), pfx_z (vertical spin movement), and starting velocity. This is where R becomes quite useful. It has the power to use cluster analysis for multiple variables. I'm still a novice in R so I will work with k-means clustering which is pretty simple to work with in R.

Here is the step by step process.

1. Follow Part 1 if you don't have a database yet.

2. In Part 2, follow steps 1-7 if you don't have R installed and connected to your database. When R is installed, follow step 8 to connect it to your database in RGui (the interface of R).

3. To use cluster analysis, you need to install the package cluster in RGui. On the top bar, click packages and scroll down to "Install Package(s)..." and click it. Select the appropriate CRAN mirror and when the list of packages comes up select "cluster" to install it. Then either go to packages on the top bar and select load packages and click on cluster or run the command library(cluster).

4. Now for the query. We need to gather data for the three variables mentioned earlier for an individual pitcher. I usually grab it in individual years to remove as much noise as possible or if the pitcher changed his movement/velocity on some of his pitches (rare). And you want to take out pitches that were pitchouts or intentional balls as they are not real pitches.

query <- sqlQuery(channel, "select p.pfx_x, p.pfx_z, p.start_speed from atbats a join pitches p on p.ab_id = a.ab_id join games g on g.game_id = a.game_id join players pl on pl.eliasid = a.pitcher where pl.last='verlander' and year( and p.pfx_x is not null and (pitch_type<>'in' and pitch_type<>'po') order by p.pitch_id asc;")

Keep the ORDER BY part as it makes sure your query is in a particular order (this becomes useful later).

5. Once you have the query you want, time to run the cluster analysis. These are commands to run the analysis and then plot them into R.


The number corresponds to number of clusters you want. In this case, Justin Verlander only has four pitches: fastball, changeup, slider, and curve. Now to plot them use this command.


You should now have a plot that appears like this.



As you can see, all three variables are plotted and there are six nice graphs plotting each against each other. The one on the far left is the x and y plot, the common pfx_x and pfx_z graph (spin movement). Under that plot is the pfx_x by velocity plot which visually brings out any pitch clusters quite clearly. For the sake of learning how to run cluster analysis, I will leave the problems associated with this step at the end of the post and continue on with the process.

6. Now that you have the clusters you want, it is time to put them in a spreadsheet in order to reclassify any missed pitches but more importantly, place them into a MySQL database. The easiest method I found is to create a csv file, then upload that file into Excel.


You can create any file path you want as R will make a text file directly from its software. But this text file only contains a number that corresponds to the cluster R assigned for each pitch. You then need to get the pitch id from your database and some other variables to know which cluster is which in Excel. Using the same query as before, just add the pitch_id in the select portion of the query.

query1 <- sqlQuery(channel, "select p.pitch_id as id, p.pfx_x, p.pfx_z, p.start_speed from atbats a join pitches p on p.ab_id = a.ab_id join games g on g.game_id = a.game_id join players pl on pl.eliasid = a.pitcher where pl.last='verlander' and year( and p.pfx_x is not null and (pitch_type<>'in' and pitch_type<>'po') order by p.pitch_id asc;")

Now create another text file that contains this query.


7. To load up the text files into Excel, click on the data tab and click on From Text in the get external data section. Load up both files individually into the spreadsheet. Click next on the first window but when the second window comes up, click on Comma in the delimiters section. This makes sure that Excel separates data into columns. Once you have organized the data known which pitch each cluster number is, you can put those pitch ids into the database.

8. If you are using the database from Part 1 or Mike Fast's method, there should be a column called my_pitch_type in the pitches table. This is where you will place your pitch classifications. Most MySQL guis should have a function that allows you to change the structure of your tables. In SQLyog, open the tables in your database and right click on any column type and click manage columns.



The default data type for my_pitch_type is tinyint which is for numbers. Changing it to varchar will allow MySQL to input letters into the column. Once you have changed it, you can now input your pitch classifications into the database using the UPDATE function. This is how the function should look like.

UPDATE pitches SET My_pitch_type='FF'

WHERE pitch_id=

OR pitch_id=

Of course you will have multiple pitch ids. To do this, just go into a new sheet in Excel and fill an entire column with OR pitch_id= using the drag and drop method. In the next column, copy and paste the pitch ids from the previous sheet for the pitch type you want.

Then do this for all the pitch types and copy and paste the columns into your MySQL gui and it is now set to run it. Just remember to change the pitch type in the SET function for each pitch type you insert and keep it simple by using the same labels pitch f/x uses in its own your database!

Potential Problems

Going back to step five, which is the most crucial part of the process, this is where you just have to play around in R. Sometimes the clustering doesn't look like you want it to. You can just run the cluster function and plot it again to see if it returns some more desirable results. If a pitch was used with little frequency, it is likely k-means won't register it as an individual cluster. You might have to change the number of clusters k-means finds in order to get the most accurate results.

This part depends highly on what the pitcher throws in the eye of pitch f/x. You might have to cut the sample size to half seasons, or run a individual k-means on pitches that are tightly clustered, for example fastballs and changeups or two and four seam fastballs. Also remember that this cluster analysis doesn't take into account individual ballparks or changes in arm angle. So you may have to cut the sample size by ballpark or date although this rarely should happen/

This isn't a perfect analysis and there may be a better cluster function in R but it does help out a whole lot when trying to id pitches in a large sample size and it produces good results most of the time.

If you have any questions and comments about this post or any of the other parts of this series, please feel free to address it in the comments section.

Trending Discussions