An example of how one can use pitchf/x data: to make a heat map showing the location of pitches to a batter. However, such uses of pitchf/x data can go horribly wrong if you're not careful.
Since 2008 (really 2007), Pitchf/x data has tracked every pitch thrown in the major leagues. As a result, the sabermetric community has learned a great deal about how various pitches and pitchers themselves work by analyzing this data.
Of course, Pitchf/x can easily be used, and in fact is mainly used, to simply look at the "stuff" of a specific pitcher. Many writers in the last few years have used the technology to get a closer look at a pitcher, whether their work was about someone a knuckleballer, an insane sinkerballer, or even just a person's favorite pitcher on his team. And most of this stuff has simply been great, great stuff, which has taught readers more about the pitcher in question and made them more aware of their particular talents and weaknesses.
That said, many of these works share a few flaws, in that they make what are essentially common mistakes about the technology and thus give slightly (or sometimes not so slightly) distorted views of the pitcher that their work is on.
In this post, I'm going to explain how basically those interested in this technology can use it to write about pitchers they are interested in, and how they can avoid several of the more common such mistakes made by those who work with the data.
How to Use Pitchf/x Technology:
There are two ways of working with the Pitchf/x data. The first is to actually download the data to work with it: this is the best method for really doing pitchf/x analysis, but obviously requires more time and work as well as more knowledge of how to use at least Excel, if not some other statistical program.
If you don't want to do that, but want to work on a pitcher, fear not! Several sites make available pitchf/x data in easily readable forms (graphs, heat maps) and with data nicely presented that will allow you to use the data to analyze the pitcher anyhow.
Lets go over these methods in turn:
Actually Downloading the Data:
The most basic way to use Pitchf/x technology is to download the data and work with it in Microsoft Excel, or preferably some other data analysis program (I use SAS' JMP, which I highly recommend, but well...it's not free and I got it from school). The data at its core is found at http://gd2.mlb.com/components/game/mlb/, in xml files in the inning folder of each game. However, since the data is separated by inning (it used to be by pitcher), it's really a pain to download really manually.
Fortunately, there are many alternatives that allow for someone to easily obtain the data that they want. Joe Lefkowitz' tool allows for you to download the data of a specific pitcher. Brooks Baseball allows for you to download to excel data for a pitcher's numbers for a specific start (the data is listed under expanded table data). If you want data of the whole season or more, there are multiple downloads out there on the web that you can use. Alternatively, there are also available scripts that will allow you to download the entire pitchf/x database around the web (I'd link you to my own, but the site I got it from has apparently gone "private," so I'd rather not).
Once you have the pitchf/x data downloaded, you can open it up in whatever spreadsheet program you use and use the data to do whatever you want. For an explanation as to how the data is organized into columns, see HERE.
Working with the data as displayed online:
Pitchf/x data is presented in usable forms in various places around the internet.
First of all, there's Fangraphs: Fangraphs carries pitchf/x data for each pitcher under a tab on each pitcher's page. Fangraphs carries average data - horizontal and vertical movements as well as pitch velocity, for every pitch of a pitcher. These numbers need to be taken with a grain of salt, for reasons that will be explained in the first section below (Please DO NOT IGNORE THIS PART). Fangraphs also carries pitchf/x charts for each game of a pitcher as well as graphs for a full season's worth of pitches. Fangraphs is also the only place where locational heat maps can be found as far as I know, which can be useful.
However, fangraphs does not provide a great variety of graphs, so it often doesn't have what you want displayed, and these graphs have been known to be missing from pages of players new to the league. As such, it's not the place I'd first go for the data.
The second place that carries the data, and the place most frequently cited, is Texasleaguers. Texasleaguers is pretty great in general, in that it contains the basic average numbers for pitches, as well as relevant pitch results such as swing rates. Of BIG NOTE: Texasleaguers uses the term "Whiff Rate" to refer to the percentage of pitches that result in swinging strikes. Texasleaguers also includes some relevant graphs which are quite nice.
Texasleaguers is a fairly good source, but remember it suffers from the same pitch classification problem as fangraphs.
A Third place you can find such data is Joe Lefkowitz's site. On this site, you can find the pitchf/x data, just like Texas-Leaguers, which is very nicely presented. It also includes GB rates, which are for some reason not in Texas Leaguers. It also includes the most comprehensive graphs of the three sites. Of BIG NOTE: Lefkowitz's data includes the "swing-miss rates" of pitches. UNLIKE the whiff rates of Texasleaguers, this is measuring the amount of times that batters whiff when batters swing (Pitches-Whiffed On divided by Pitches Swung at).
Once again, the Lefkowitz site suffers from pitch classification problems as discussed below.
A fourth place you can find pitchf/x data is Brooks Baseball. Brooks is notable for being the only one of these sites to update after each batter in a game. Thus if you want to know the pitch composition of a pitcher as the game goes on, you can call up the game on Brooks Baseball and find it out. The basic numbers up top use the same classifications as the other websites, but you can figure out what the pitch composition is in a game just by seeing the excellent graphs.
OKAY, so now you know where to find the data so that you can use it. But before you do so, you need to be aware of some mistakes you might otherwise make, which are common to pitchf/x analyses, that you must avoid when using the data. These mistakes can end up rending your work on the pitcher in question utterly meaningless.
What Mistakes to Avoid when using the data:
Mike Fast has covered the basics of what mistakes not to make when doing a pitchf/x analysis last year in a great article for The Hardball Times. It's a great article, and I'd advise you to read it HERE if you're interested in doing the work. I'll be reiterating some of his points, while adding some others.
FIRST: Do NOT Trust the Pitch-Type Classifications!
So, when you get the data, you'll notice that each data point, representing a specific pitch, is marked as being of one of many pitch types. Who decides whether pitches are fastballs, change-ups, sliders, curve-balls, or any other pitch type?
The answer is a computer algorithm designed by MLB Advanced Media (MLBAM), which classifies each pitch shortly after it is thrown (On occasion, the pitches are later re-classified after games are over). The algorithm attempts to use the movement of each pitch, along with the speed of that pitch, to determine what pitch type that pitch really was. To help it with this work, the algorithm is told what are the pitches one would expect to see from a pitcher, so that it can choose the correct pitch type when pitches are really borderline. As the season goes on, a pitcher's pitch type information often becomes more accurate, as the algorithm is told that a pitcher uses a specific pitch type he may or may not have done before, and adjusts its classifications in the future accordingly.
The problem is that these Pitch Type Classifications are often not great and can sometimes be DOWNRIGHT INACCURATE. The algorithm has gotten better every year, but it still has many flaws. For example, the algorithm has massive issues determining whether a fastball is a four-seam fastball or a two-seamer, and the algorithm for better or for worse has classified more two-seam fastballs over the last year than ever before (I suspect it's getting more accurate, but it's still very eh).
Moreover, since the algorithm is updated during the year to correct for new pitches, the system will often call a pitch one pitch type for part of the year and another for the rest of the year, which is a problem when dealing with fangraphs, texasleaguers, or Joe Lefkowitz' numbers. For an example: Mike Pelfrey's splitter was classified in the first few months as a change-up until the system was told it was a splitter. From that point on, the "CH" designation was replaced near entirely with the splitter designation "FS" for those pitches.
So what can you do about this problem? First, look at the graphs that you get along with Texasleaguers or Joe Lefkowitz' or even fangraphs' data. You can see that the pitches tend to fall into clusters. If the algortihm is misclassifying pitches, you'll often notice that clusters are of multiple colors (representing multiple pitch types). This can help you recognize which "pitch types" are really the same. Secondly, view with skepticism any pitch types found infrequently by the algorithm. These are probably just errors in the system.
If you're playing with the data itself, having downloaded it via one of the methods itself, you can fully resolve this situation by manually classifying the data (or using K-Means Clustering). You do this by organizing the data into clusters based upon the pfx_x, pfx_z, and start_speed data values.
In all reality, the algorithm does a REALLY REALLY good job at what it's supposed to do: the pitchers of individual pitchers tend to be somewhat unique to that pitcher and thus making a system that classifies these pitches in real time is really really REALLY hard. And the system isn't terrible: for example, in general the system does a fairly good job at identifying curveballs. It also is fairly good at identifying other non-fastball pitches. The algorithm IS NOT very good at distinguishing different fastballs, but overall it can be solid. Just don't treat the numbers as absolute.
Also, as Mike Fast wrote in his article, the numbers tend to change year to year as the algorithm changes...so don't try comparing the numbers based upon classifications from year to year or even month to month at times: look at the graphs in these sources to ensure that things have indeed really changed.
SECOND: Certain Pitch Types are in fact basically the same thing:
This is really a smaller sub-point of our first point: the MLBAM algorithm contains several pitch classifications that really represent the same thing. For example, the algorithm will classify some pitches as "FT," for 2-seam fastball, and "SI," for sinker. A sinker IS a two-seam fastball. What goes on is that the algorithm is told who is a sinkerballer, and will use SI instead of FT for those pitchers. Of course a problem is that sometimes the algorithm will switch from FT to SI mid-season, making it seem like the pitcher has switched pitches. That is NOT the case: please don't be fooled.
Similarly, the classification KC, for Knuckle-Curve, doesn't mean anything different from "CU." If a pitcher has both according to pitchf/x...remember that they are the same pitch.
Finally, while Cutters (FC) and Sliders (SL) are NOT the same pitch, the system has difficulty figuring out when a pitch is one rather than the other, since they are rather similar. Check the graphs: often the system will switch from one to the other as it's told what the pitch actually is by some human operator. Don't assume a pitcher throws both (though some pitchers, like Cliff Lee, DO throw both pitches).
THIRD: Do NOT Rely Upon Pitch Type Run Values:
Pitch Type Run Values (often labeled just as run values or displayed as the run value per 100 pitches) are essentially an application of linear weights to individual pitches. Pioneered by Joe Sheehan* at Baseball Analysts, essentially these calculate the amount of runs prevented (or created) by each individual pitch that a pitcher throws. The pitch type run values numbers you see in certain places add up the total runs created/prevented by each individual type of pitch that a pitcher throws. The idea of course is to use these values to create a quick measure of how good a pitch is. You can find these values fairly easily on fangraphs under Pitch Values. So according to these values, the two most effective fastballs (per pitch) last year belonged to Tim Hudson (okay sinkerballer) and Ted Lilly. A quick note, technically negative run values are supposed to be GOOD for pitchers and BAD for hitters. Fangraphs however, reverses this for pitchers (but not for hitters, so positive numbers are good for both pitchers and hitters).
*I'm fairly certain Sheehan was the first to use these for this type of analysis, but if someone else came up with it first, let me know.
However there are a great deal of problems with pitch type run values aside from the confusion over the negative sign. For one, these numbers do not try to adjust for luck on balls in play. Thus if a pitcher has a lower BABIP, he'll have high run values on his pitches, while a high BABIP will result in lower run values. This is a problem: in these cases, the run values won't be really telling how good each pitch type is for a pitcher, but simply how lucky a pitcher is on those pitches. In other words, any true measure of a pitch's value in these run values is swamped by the luck-driven results on balls in play. The end result is that you have misleading results like Zach Greinke's curveball seeming to be a poor pitch this year, when in reality it wasn't: it just had very bad luck on balls put in play. Run values also can be badly affected if a pitcher has great or poor luck with home runs per fly balls gotten by the pitch.
Now there is a solution for this problem: something called Expected Run Values (often abbreviated as RVe or RVe100 for 100 pitches). These are run values that also use the standard run values for pitches that are not put into play (or hit out of the park). However, for balls put in play, expected run values don't rely on the actual results of these pitches, but instead use the average run value for each batted ball type (ground ball, fly ball, line drive, or pop up). The end result is a run value number that ISN'T ruined by luck-driven batted ball results. However, expected run values are NOT carried by fangraphs or any other major site, making this statistic unavailable for most people who wish to do a pitch analysis. But if you do have access to these values, please use them instead of standard run values.
Unfortunately, both normal pitch type run values AND expected run values have another problem: they are heavily context dependent. See, the run value of a pitch depends upon when in the count the pitch is used. If a pitcher uses a pitch only on 3-0 counts (a fastball, for instance), it's run value cannot possibly be very high. This is because if the pitch is a ball, it's run value will be pretty poor, but if the pitch is a strike, then it's run value won't be very good at all, because getting from a 3-0 to a 3-1 count isn't really that great. More realistically, some pitchers only use certain pitches on early counts such as 0-0 (you see this a lot with change-ups). Those pitches' run values will be lower than say a pitch used only on 0-2 or 1-2, as the latter counts will result more frequently in outs. Pitch Type Run values DO compensate for this somewhat....a hit on an 0-2 count (where an out is likely) is treated as worse for a pitcher than a hit on a 3-0 count (where the runner getting on base is very likely). But it's not enough to adjust for how pitchers use certain pitches.
In addition, run values can lead to people being misled as to the true value of a pitch. Remember, the value of a pitch is NOT independent of the quality of the same pitcher's other pitches. In other words, if a pitcher is an excellent sinkerballer, the pitchers off speed pitches may wind up with higher run values per pitch than the sinker. This doesn't mean the pitcher should throw those pitches more often, it simply means that the higher quality and frequency-of-use of the sinker makes batters expect the pitch and makes the other pitches catch batters off guard. For a great example, both Tim Wakefield and R.A. Dickey's fastballs are frequently rated pretty high by fangraphs' run values. Is this because those pitches, which are well kind of slow for fastballs, are really good pitches? Of COURSE NOT: What's really happening here is that batters facing a knuckleballer aren't expecting the fastball so they let the pitch go by, where it's nearly always a strike. Thus when you look at run values, the fact that a pitcher's less used pitch has a better run value per 100 pitches DOES NOT MEAN that the pitcher is using his pitches inefficiently.*
*In reality, the most optimal usage of pitches is for a Pitcher to use each of his pitches at a rate so that the total run value of his pitches is at its maximum. Thus IT MAY BE OPTIMAL for a pitcher to throw a pitch with a POOR RUN VALUE, even when that seems to cost him runs, in some cases. In those cases, the poor results of that pitch set up the pitcher's other pitches and make them more effective - a trade that could be well worth it.
Okay, I'm cheating with this section here: Pitch Type Run Values on fangraphs - which is where most people get them from - are not based on pitchf/x numbers but are based on data from humans at Baseball Info Solutions. This actually makes these values worse: the BIS guys aren't much better at classifying pitches than the MLBAM algorithm and they DO NOT make distinctions between various types of similar pitches. Thus BIS treats all fastballs the same and thus compares sinkers to high heat 4-seam fastballs. The end result is that these results are even more unreliable for whatever use you want for them.
Some people will cite pitch type run values that they calculated using pitchf/x numbers, and unless these are expected run values, the same caveats listed above still apply. Please, be very careful in using these numbers guys, they often can lead you to very incorrect judgments about pitches.
Instead of using these run values to judge a pitch's value, take a look at the Swinging Strike Rates (Whiff and Swing Rates) and GB Rates of a pitch - they'll give you a better picture of how good a pitch is at getting batters out.
FOURTH: Be Very Careful in How You Use Heat Maps
Heat Maps are a way of displaying the results of pitchf/x data in a way that seems very intuitive and looks really really cool. Essentially a pitchf/x heat map displays the results of a pitcher or batter based upon where a pitch is when it "crosses" home plate. Alternatively, heat maps frequently are used to show how frequently a pitcher throws in each part of the strike zone or where pitchers have located their pitches against a certain batter. An example of such you can see below. It used to be that such heat maps, which are a great graphic to add to a pitcher or batter analysis, were basically reserved for those individuals who knew how to work with the data themselves in various statistical programs.
Not any longer. Now there are several places where people can find heat maps:
Fangraphs just added to their pitchf/x page for each pitcher a tab marked "heat maps." Under this tab, you can adjust settings to view heat maps that show the locations in which a pitcher throws each pitch type most frequently.
Beyond the Boxscore's own Jeff Zimmerman has a website that allows for the creation of run value hat maps. These use the aforementioned run values to make heat maps showing where batters are good at hitting and where they're poor. The site's not public yet, but Jeff intends to do so soon and he currently is allowing some people to beta test the site.
TruMedia has a program for its Baseball Analytics site that is used on that site, by certain bloggers who have been given access to the media, and ESPN blogger Mark Simon. This program has created a variety of heat maps showing how well hitters or pitchers do in certain areas of the zone. This program seems as far as I know to be the most versatile for creating heat maps, having created ones showing batter/pitcher swing rates, Slugging, on base percentage, and more based upon the area of the strike zone hit by a pitch. This program is not free to the public however, but well, if you're one of those who have access to it, you should still know the limitations of the technology.
The problem is that heat maps, while they look really nice, can be extremely misleading and are often completely unpredictive as a measure (making them near worthless at times). First of all, as mentioned above, heat maps from these programs USE MLBAM CLASSIFICATIONS. As mentioned above, these can be extremely poor and may result in the results of a heat map being completely useless (due to bad pitch classification).
Moreover, there's more often than not a sample size problem. Heat Maps showing just the locations of where pitches are when they cross the plate are for the most part fine (as the sample size is what's being reported). However, where you have a small sample size for these heat maps, you're often better off just using a scatter plot showing where the individual pitches have been located.
But other types of heat maps, showing the results of pitches by location, are more problematic. Take a heat map of the swinging strike rate of a batter based upon where he faces pitches. Essentially, what a heat map of such a batter is doing is taking the results in certain areas (often referred to as "bins" or "buckets") and is then smoothing out the results to form a nice looking heat maps. The problem is however that when multiple areas have very small sample sizes, you end up with large parts of a heat map based upon data from way too small sample sizes to mean anything. This is true even if your overall sample size is large enough to make judgments about the player in general.
As an example, I've put together two heat maps below based upon the first 630 pitches faced by Mets 1B Ike Davis last year, through his first month and a half. These heat maps show how frequently Ike Davis swings at each area in the strike zone. The heat map on the left shows in general the swing rate of Ike Davis based upon the location of a pitch. As you can see, it looks like Ike swings really only if a pitch is middle in with various exceptions:
But the heat map on the right shows the problem here. See, how many pitches would you really think you need in an area to tell how often a batter is likely to swing at pitches in that area? Definitely more than five, right? So for the heat map on the right, I removed all the data from areas/bins/buckets that didnt' have at least 5 pitches worth of data. The end result is that we lose most of our heat map, including NEAR the entire INSIDE part of the strike zone, which was a big hot spot in the first heat map! See batters (and pitchers) will throw at certain parts of the strike zone much more frequently than others, especially if you limit your heat maps to specific pitch types (which have smaller sample sizes).
This same sample size issues concern other types of heat maps as well, especially for showing data from out of the strike zone, where pitchers throw less frequently (obviously).
Does this mean that heat maps are useless? Certainly not! Location heat maps are nice and useful for showing locations for larger sample sizes, where scatterplots are hard to read due to lots of data clumped together. Other types of heat maps are really useful for figuring out trends among more than one batter or pitcher, as the sample size problem diminishes.
But using a heat map to show the strengths and weaknesses of an individual batter or pitcher is potentially problematic due to the problem of sample size. You can get around some of this problem with regression, but it's still there nonetheless.
Pitchf/x data can teach us a ton about the game of baseball. And now, it's particularly accessible for really anyone to use. But in using such data, you have to be careful not to make the common mistakes mentioned in this article or in the Mike Fast article listed above. They're quite easy errors to avoid; so just please, if you use the data: don't make these errors.