Filed under:

# PitchFX, Dirt, and Parks

The last two years, I've published rankings of how successful catchers were at blocking balls in the dirt.  I've been leveraging the Pitch FX data from MLB for this analysis, but I haven't really used the full power of the technology.  To this point, I've relied on the Gameday stringers to classify whether a pitch was in the dirt or not.

Harry suggested that I look beyond the human element and use the more detailed pitch location information to determine when a pitch would hit the dirt.  Luckily for me, he was kind enough to provide a formula that allowed us to figure out at what point the pitch would hit the ground.  After going back and forth on it for a little while, and confirming with some other people, we decided that all pitches that landed within 3 feet behind the front of home plate could be considered to be balls in the dirt.1

In 2008, comparing the scorekeepers to the computer system led to the following difference:

 Stringers Pitch FX 13332 23147

So the stringers identified pitches in the dirt only 60% of the time that the Pitch FX system did.   I grew curious about such a great discrepancy (which only got larger if we moved the catcher's location back to -3.5 or -4 feet).

My first thought was that the scorers in certain parks had a tendency to report fewer balls in the dirt than their cohorts in other parks.  This table breaks down the identified balls in dirt by park and calculates the percentage that were correctly identified by the stringers.2

 Park Stringers Pitch FX % ANA 474 969.48 0.49 ARI 402 915.43 0.44 ATL 339 797.2 0.43 BAL 471 861.22 0.55 BOS 456 638.28 0.71 CHA 399 681.51 0.59 CHN 440 791.81 0.56 CIN 487 903.05 0.54 CLE 401 735.38 0.55 COL 394 734.94 0.54 DET 424 884.92 0.48 FLO 462 846.71 0.55 HOU 519 761.14 0.68 KCA 398 887.66 0.45 LAN 482 792.69 0.61 MIL 530 869.11 0.61 MIN 369 599.29 0.62 NYA 441 767.8 0.57 NYN 418 635.49 0.66 OAK 464 776.04 0.6 PHI 463 880.38 0.53 PIT 438 852.79 0.51 SDN 440 708.83 0.62 SEA 408 664.32 0.61 SFN 443 823.21 0.54 SLN 481 974.54 0.49 TBA 463 796.54 0.58 TEX 438 550.34 0.8 TOR 507 857.35 0.59 WAS 477 899 0.53

The values range from Atlanta at the bottom, where only 43% as many pitches were identified by the humans compared to the computers, to Texas, where the stringers called 80% as many balls in the dirt as did Pitch FX.  But that's not the really interesting piece of information to me.  Notice the discrepancy in the number of pitches that Pitch FX located as in the dirt.  Texas only had around 600, while St. Louis was almost at 1000.

There are a lot of things that could cause such a large difference between parks.  My first thought is that some pitchers just tend to throw more balls in the dirt than others.  Perhaps the Cardinals' staff throws a lot more splitters than does the Rangers'.  If that were the case, we'd expect to see roughly the same number of balls in the dirt when a team was on the road as when it was at home.

So I looked at how many pitches in the dirt each team threw both home and away.  I then normalized the results around whichever had fewer pitches thrown.  Finally, I calculated the single season park effects following the steps on Baseball Reference.3

Let me share a quick example before the results.  Let's look at the Texas Rangers.  As the home team, they had 244 pitches flagged as in the dirt according to Pitch FX.  Overall at home, Pitch FX captured 11991 pitches and missed 369, for a capture rate of 97%.  That allows us to scale the expected balls in the dirt1 to 251.52, so Texas had roughly 2 percent of its pitches in the dirt.

On the road, Texas had 10911 pitches registered with Pitch FX, and 684 missed.  The raw number of balls in the dirt was 293, and the scaled number was 311, for just under 2.7%.

Next, I normalized the results to the smaller number of pitches - in this case those as the away team - giving 311 balls in the dirt on the road, and 236 at home.  We divide the home numbers by the away numbers to get the initial park factor, in this case, .759.  Finally, we apply the Other Parks Corrector, which accounts for the fact that the averages of all the other parks include the ratings of this park.  This is calculated as  n / (n -1 + IPF) where n is the number of teams (30) and IPF is the initial park factor we calculated in the previous step.  In the Rangers' case, this results in a one year Balls in Dirt Park Factor of .765, by far the lowest in the majors.

Here are the results for the entire league, and you can find my complete spreadsheet up on EditGrid.  vNBID is the Normalized Balls in Dirt as the visiting team, while hNBID is the Normailzed Balls in Dirt at home.  PF is park factor.

 Team vNBID hNBID PF ANA 464 508 1.091378 ARI 366 436 1.18371 ATL 403 404 1.002398 BAL 447 499 1.112019 BOS 322 270 0.843047 CHA 349 295 0.849654 CHN 342 333 0.974539 CIN 439 481 1.092189 CLE 340 307 0.905872 COL 388 377 0.972569 DET 364 451 1.229218 FLO 395 374 0.948516 HOU 336 378 1.120332 KCA 402 498 1.229023 LAN 455 429 0.944656 MIL 523 495 0.948155 MIN 284 273 0.96251 NYA 358 387 1.078095 NYN 258 264 1.022463 OAK 283 391 1.364271 PHI 453 470 1.036231 PIT 376 357 0.95107 SDN 334 329 0.985522 SEA 344 338 0.98313 SFN 451 429 0.952769 SLN 451 520 1.147143 TBA 352 371 1.052084 TEX 311 236 0.764992 TOR 402 488 1.205335 WAS 483 449 0.931793

I'm not sure what causes there to be a park factor for balls in the dirt - or even if it's a true effect.  One season of data is nowhere near enough to go on, so I'd like to replicate the results with the more limited 2006 and 2007 data and see if there's a pattern here.  Remember though, these are pitches that would be identified as balls in the dirt by the cameras and computers, not by the humans scoring the game, which should eliminate one potential source of bias.

It's possible that this discrepancy is just a reflection of some other explainable difference - perhaps one team played many more blowouts at home than on the road, so there's no need to try and get batters to chase at home.  Or perhaps some outlier pitchers happened to pitch more often on the road, therefore driving up those numbers.

What other factors could contribute to such an effect?  I'm sure there's plenty I missed, and I'd love to hear any ideas that are out there.

1 In case anyone is interested, here's the formula Harry provided me.

(`y0` + (`vy0` * ((-(`vz0`) - sqrt(((`vz0` * `vz0`) - ((2 * `az`) * (z0))))) / `az`))) + (((0.5 * `ay`) * ((-(`vz0`) - sqrt(((`vz0` * `vz0`) - ((2 * `az`) * (z0))))) / `az`)) * ((-(`vz0`) - sqrt(((`vz0` * `vz0`) - ((2 * `az`) * (z0))))) / `az`))

He tells me that's where the ball should hit the ground in relation to the front of home plate, and I believe him.

2 The reason why the Pitch FX numbers have decimals is that not every pitch was captured by Pitch FX in 2008.  I assumed that a ball in the dirt was just as likely on a pitch that was missed by the computers, and scaled the number of balls in the dirt to the total number of pitches.

3 Although Baseball Reference describes an iterative process to get the proper park factors for batters and pitchers, I didn't think it applied in this case because I was looking at a single number versus two correlated values.