As I commented in an article about seven months ago, people like comparisons in baseball. We like to be able to compare an up-and-coming player to an established veteran, and -- if possible -- somehow quantify that comparison.
This was the idea behind Pitcher Similarity Scores. The idea was to compare pitchers based on the pitches they throw, not on their results, and in comparing, we got a score bounded between [0,1] to describe the similarity. I won't rehash the whole formula for the scores -- you can read about the details here -- but the general idea was to look at the velocity, break, arm slot angle and pitch location to compare pitches.
The scores generated many comments and suggestions for improvement, including some suggestions that were originally considered and left out for various reasons. Two comments in particular struck me as important to be the next step in improving these similarity scores. In addition to explaining these improvements, I will include the most similar pitchers of 2013, as well as the most similar individual pitches of 2013.
Improving Similarity Scores: Lefties and Righties
The first of the two comments was that some found it interesting that left-handed pitchers were being compared to right-handed pitchers. A high similarity score between these lefties and righties implied that the two pitchers were mirror images of each other.
In calculating these similarity scores, the thought process was that of a batter who is facing a pitcher for the first time. He has never seen this pitcher's arsenal, but is told he has a fastball like Pitcher X, and a changeup like Pitcher Y. He then can reach back to his experiences against Pitchers X and Y and have a general idea of what to expect. Obviously, if Pitcher X is not the same handedness as the new pitcher, it makes no practical sense for the batter to compare the two. So, for this (and future) incarnation of these similarity scores, right-handed and left-handed pitchers will not be as highly comparable to each other.
Improving Similarity Scores: Pitch Sequencing
The most common suggestion to improve the similarity scores was to include pitch sequencing somehow in the process. Originally they were included in the formula, but there were some slight complications in its inclusion, so they were removed.
The main difficulty in pitch sequencing is dealing with incomplete sequences within an at bat. In example, are the following two sequences the same: FA-CU-FA and FA-CU? What if the pitcher intended to throw a fastball as his third pitch in the second sequence ? Are they the same in that case?
In order to compare the sequences, we have to ask, ``What's the shortest sequence we could possibly see?" That would be a one-pitch at bat. However, if you say that the first pitch is preceded by ``nothing" you can even call a one-pitch at bat a sequence of length 2. So, since we can have at minimum a sequence of length 2, that's what we'll look at: All sequences of length 2 within an at bat. So, for example, say we have an at bat with the following sequence of pitches: FA-CU-FA-CH. In this case, there are 4 sequences of length 2: O-FA, FA-CU, CU-FA, FA-CH, where the entry ``O" corresponds to the ``nothing" that precedes the first pitch.
Now we need to get these sequences in an appropriate form so that we can work them into the similarity score. To begin with, we'll put these sequences in a contingency table. For example, let's look at an example below where the pitcher only throws fastballs and changeups. In this table, the rows are the first pitch in the two pitch sequence, while the columns are the second pitch of the sequence.
Before we continue, we need to remember two things; first, that the pitchers throw different numbers of total pitches, and second, they throw different numbers of each of the individual pitches. To take this into account, we need adjust for the expected number of sequences seen based on the number of individual pitches. In order to do this, we'll assume independence, so that the expected number of sequences Ei,j is
Ei,j=(∑i Oi,j)(∑j Oi,j)/(∑i,j Oi,j)
This is where Oi,j is the observed table that we saw above. The expected table for that table would be
From here, we'll look at a scaled form of the residuals from this expected table. This table is denoted Ri,j and is calculated
For the above tables, we get a scaled residual table R of
Finally, to compare the two pitchers, we'll take the two scaled residual matrices R1 and R2, subtract them from each other, sum up the absolute differences, and divide by two. Or, in math notation
Di,j = ∑i,j |R1i,j-R2i,j|/2
This quantity Di,j is bounded between [0,1], which makes it easily combined with the other components of the similarity scores. However, we need to re-weight the various components of the similarity scores to do this. After the inclusion of sequencing, the weights are
The Most Similar Pitches of 2013
So, now that we have a new version of the similarity scores, we can recalculate them for the pitchers in 2013. For 2012, only the overall similarity scores of pitchers based on their entire arsenals were included. Here, however, the most similar pitchers for each individual pitch will also be given. This is the main merit of the similarity scores. When combining across pitches, comparisons can be a bit muddier. However, this muddiness is removed when looking at one pitch at a time.
From these individual pitch comparisons, you can ``create" a Franken-pitcher profile by looking at his most similar comparisons. For example, Mets phenom Matt Harvey has a four-seam fastball most similar to Stephen Strasburg, a curveball similar to Grant Balfour (although not strongly similar), and a slider similar to LaTroy Hawkins (Again, only somewhat similar). Below, you can download the entire similarity scores matrix for each of the seven most common pitches, but we will list the most similar for each pitch below.
|Jorge De La Rosa
Most Similar Pitchers of 2013
Of course, in addition to the Franken-pitcher approach, we can look at a pitcher's arsenal as a whole. This is explained in the original article on similarity scores, and the method is no different than before. So, without further ado, 2013's most similar pitchers are -- envelope please -- Ervin Santana and Juan Nicasio.
Now, just because two pitchers use similar pitches does not imply that they'll have the same results. There is of course still many aspects of pitching that require explanation beyond similar arsenals before we can get at the heart of why one pitcher is successful and another falls flat.
. . .
PITCHF/x data courtesy of Baseball Heat Maps.