As I commented in an article about seven months ago, people like comparisons in baseball. We like to be able to compare an upandcoming player to an established veteran, and  if possible  somehow quantify that comparison.
This was the idea behind Pitcher Similarity Scores. The idea was to compare pitchers based on the pitches they throw, not on their results, and in comparing, we got a score bounded between [0,1] to describe the similarity. I won't rehash the whole formula for the scores  you can read about the details here  but the general idea was to look at the velocity, break, arm slot angle and pitch location to compare pitches.
The scores generated many comments and suggestions for improvement, including some suggestions that were originally considered and left out for various reasons. Two comments in particular struck me as important to be the next step in improving these similarity scores. In addition to explaining these improvements, I will include the most similar pitchers of 2013, as well as the most similar individual pitches of 2013.
Improving Similarity Scores: Lefties and Righties
The first of the two comments was that some found it interesting that lefthanded pitchers were being compared to righthanded pitchers. A high similarity score between these lefties and righties implied that the two pitchers were mirror images of each other.
In calculating these similarity scores, the thought process was that of a batter who is facing a pitcher for the first time. He has never seen this pitcher's arsenal, but is told he has a fastball like Pitcher X, and a changeup like Pitcher Y. He then can reach back to his experiences against Pitchers X and Y and have a general idea of what to expect. Obviously, if Pitcher X is not the same handedness as the new pitcher, it makes no practical sense for the batter to compare the two. So, for this (and future) incarnation of these similarity scores, righthanded and lefthanded pitchers will not be as highly comparable to each other.
Improving Similarity Scores: Pitch Sequencing
The most common suggestion to improve the similarity scores was to include pitch sequencing somehow in the process. Originally they were included in the formula, but there were some slight complications in its inclusion, so they were removed.
The main difficulty in pitch sequencing is dealing with incomplete sequences within an at bat. In example, are the following two sequences the same: FACUFA and FACU? What if the pitcher intended to throw a fastball as his third pitch in the second sequence ? Are they the same in that case?
In order to compare the sequences, we have to ask, ``What's the shortest sequence we could possibly see?" That would be a onepitch at bat. However, if you say that the first pitch is preceded by ``nothing" you can even call a onepitch at bat a sequence of length 2. So, since we can have at minimum a sequence of length 2, that's what we'll look at: All sequences of length 2 within an at bat. So, for example, say we have an at bat with the following sequence of pitches: FACUFACH. In this case, there are 4 sequences of length 2: OFA, FACU, CUFA, FACH, where the entry ``O" corresponds to the ``nothing" that precedes the first pitch.
Now we need to get these sequences in an appropriate form so that we can work them into the similarity score. To begin with, we'll put these sequences in a contingency table. For example, let's look at an example below where the pitcher only throws fastballs and changeups. In this table, the rows are the first pitch in the two pitch sequence, while the columns are the second pitch of the sequence.
FA 
CH 

O 
70  30 
FA 
35  70 
CH 
40  15 
Before we continue, we need to remember two things; first, that the pitchers throw different numbers of total pitches, and second, they throw different numbers of each of the individual pitches. To take this into account, we need adjust for the expected number of sequences seen based on the number of individual pitches. In order to do this, we'll assume independence, so that the expected number of sequences E_{i,j} is
E_{i,j}=(∑_{i} O_{i,j})(∑_{j} O_{i,j})/(∑_{i,j} O_{i,j})
This is where O_{i,j} is the observed table that we saw above. The expected table for that table would be
FA 
CH 

O 
55.8  44.2 
FA 
58.6  46.4 
CH 
30.7  24.3 
From here, we'll look at a scaled form of the residuals from this expected table. This table is denoted R_{i,j} and is calculated
R_{i,j}=(O_{i,j}E_{i,j})/(∑_{i,j} O_{i,j}E_{i,j})
For the above tables, we get a scaled residual table R of
FA 
CH 

O 
0.15  0.15 
FA 
0.25  0.25 
CH 
0.1  0.1 
Finally, to compare the two pitchers, we'll take the two scaled residual matrices R^{1} and R^{2}, subtract them from each other, sum up the absolute differences, and divide by two. Or, in math notation
D_{i,j} = ∑_{i,j} R^{1}_{i,j}R^{2}_{i,j}/2
This quantity D_{i,j} is bounded between [0,1], which makes it easily combined with the other components of the similarity scores. However, we need to reweight the various components of the similarity scores to do this. After the inclusion of sequencing, the weights are
Component  Weight 

Horizontal Break  0.2 
Vertical Break  0.2 
Velocity  0.2 
Pitch Sequence  0.2 
Angle  0.1 
Pitch Location  0.1 
The Most Similar Pitches of 2013
So, now that we have a new version of the similarity scores, we can recalculate them for the pitchers in 2013. For 2012, only the overall similarity scores of pitchers based on their entire arsenals were included. Here, however, the most similar pitchers for each individual pitch will also be given. This is the main merit of the similarity scores. When combining across pitches, comparisons can be a bit muddier. However, this muddiness is removed when looking at one pitch at a time.
From these individual pitch comparisons, you can ``create" a Frankenpitcher profile by looking at his most similar comparisons. For example, Mets phenom Matt Harvey has a fourseam fastball most similar to Stephen Strasburg, a curveball similar to Grant Balfour (although not strongly similar), and a slider similar to LaTroy Hawkins (Again, only somewhat similar). Below, you can download the entire similarity scores matrix for each of the seven most common pitches, but we will list the most similar for each pitch below.
Pitch  Pitcher 1  Pitcher 2  Similarity Score 

Fourseam Fastball  Bud Norris  Joe Nathan  0.9839 
Twoseam Fastball  Jorge De La Rosa  Tom Gorzelanny  0.9740 
Cut Fastball  Jake Peavy  Kyle Kendrick  0.9804 
Changeup  Jose Quintana  Martin Perez  0.9737 
Curveball  Adam Warren  Jim Johnson  0.9737 
Sinker  Hiroki Kuroda  Mike Pelfrey  0.9659 
Slider  Nathan Eovaldi  Esmil Rogers  0.9753 
Full Matrix of Fourseam Fastball Comparisons
Full Matrix of Twoseam Fastball Comparisons
Full Matrix of Cut Fastball Comparisons
Full Matrix of Changeup Comparisons
Full Matrix of Curveball Comparisons
Full Matrix of Sinker Comparisons
Full Matrix of Slider Comparisons
Most Similar Pitchers of 2013
Of course, in addition to the Frankenpitcher approach, we can look at a pitcher's arsenal as a whole. This is explained in the original article on similarity scores, and the method is no different than before. So, without further ado, 2013's most similar pitchers are  envelope please  Ervin Santana and Juan Nicasio.
Now, just because two pitchers use similar pitches does not imply that they'll have the same results. There is of course still many aspects of pitching that require explanation beyond similar arsenals before we can get at the heart of why one pitcher is successful and another falls flat.
Full Matrix of Pitcher Similarity Scores
. . .
PITCHF/x data courtesy of Baseball Heat Maps.
Stephen Loftus is a featured writer at Beyond The Box Score. You can follow him on Twitter at @stephen__loftus.
Loading comments...