I recently published an article using heat maps over at Fangraphs showing differing batter strike zones, in which a question arose about Brett Gardner vs. LHP. In particular, the question asked about a single way inside pitch that Brett swung at as seen in this image:
It looks like he swings at stuff way inside, but it was just a one time deal.
The best method I could think of to deal with this problem is to regress the data by the league average for each area. This would help smooth extreme and out of place values on the heat maps. I have found that I will need to add between 20 and 30 league weighted pitches to properly regress the data.
With this information, what do you think is the correct way to regress the data once other variables are used? For example, what if I want to look at how one player swung on 0-2 counts during 2010? Do I use the league average data for all counts since 2007 or should I just look at 0-2 counts in 2010? I think resetting the data for each scenario would be ideal, but then I run into another problem.
Currently, the process of creating the heat maps takes about 3 seconds over the internet. Figuring out the data on the fly will add anywhere from a few more seconds up to 15+ minutes per map. Also, once a second person starts a process, the heat maps will then take twice as long to produce for both people. If I pre-program in a set values for all processes to regress the output to, the heat map will be created in just seconds. This off season, I plan on making this application available to the public (some people already have access to it) and am wondering how people would feel about having a faster application or a more correct image.
Right now, I am thinking of doing a single adjustment for each of the counts and ignoring the dates. Does this seem like a reasonable middle ground or should I be more or less stringent with the data?
Let me know if you need more information or need any ideas cleared up. Thanks -Jeff