clock menu more-arrow no yes mobile

Filed under:

Failing to Predict Walk Rates

I threw everything at the wall and not much stuck. Walks remain somewhat of a mystery.

Doug Pensinger

The other day I wrote about predicting K% for pitchers (you’re welcome, Glen Perkins). We managed to use a two-input model to predict year-over-year K% with greater accuracy than K% itself, with an R-squared of .75. This was exciting, and studies using other factors have been able to inch that up to .76 or .77, based on some initial discussion.

I’ll argue in a future meta piece that given the closeness of the models, simpler is better (mirroring the FIP-SIERA-pFIP kind of debate). With that said, something tells me that determining the factors that can predict BB% won’t be quite so easy.


As has been discussed, Three True Outcomes are on the rise, and strikeouts and walks alone have accounted for more than a quarter of all plate appearances since 2006. With these no-ball-in-play outcomes becoming more significant, as well as studies showing BB and K alone can make DIPS simpler, it behooves us all to find out exactly what goes into these outcomes.

The other day we tried to predict strikeout rates for pitchers. Today, we’ll try to predict walk rates for pitchers. Eventually, if these are significant enough, we’ll try to simulate pFIP or K-BB ERA using our expected models. We’ll also dive into the hitter side. But for today, we get to the core of Jonathan Sanchez.


Using FanGraphs’ custom leaderboards, I ran regressions for pitcher seasons from 2006 to 2012 where pitchers had at least 350 batters faced (a better cut-off criteria than innings pitched). I used 2006, as that’s sort of the dawning of the "modern era" of Three True Outcomes – before then, there were seasons that touched 25 percent of all plate appearances, but since then it has remained above that level. We’ve gotta cut it off somewhere! This gives us 1189 pitcher seasons to examine.

I compared unintentional walk rate ((BB-IBB)/PA) to a handful of potential indicators: Whiff%, overall pitches in the zone (Zone%), overall swing rate (Swing%), swing rate on pitches outside the zone (O-Swing%), first strike percentage (F-Strike%), fastball frequency (FA%, based on PITCHf/x), fastball velocity (vFA, also based on PITCHf/x), and strikeout percentage (just because).


F-Strike% 0.418
Swing% 0.300
O-Swing% 0.147
Zone% 0.061
Whiff 0.046
FA% (pfx) 0.044
vFA (pfx) 0.039
ALL = xBB% 0.604 0.013
FS% & Sw% 0.482 0.015
FS% Sw% Z% 0.484 0.015
FS% Sw% OS% Z% 0.519 0.014
All but FBs = xxBB% 0.602 0.014

So, this is gonna suck. Right away we see there’s no one factor that stands out like Whiff% or Swinging Strike Rate for strikeouts. In fact, no one input can explain even half of the variance in walk rates. First Strike Rate was the best predictor with an R-squared of .418, but that means it’s only explaining about 42% of the variance in walk rates. That’s appreciable, but it’s not significant enough.

You’ll see in the bottom few rows that I was toying with different combinations of factors to improve the efficacy of a model. It wasn’t until we included five inputs that we reached an R-squared of .6, and it topped out at .604 if we threw absolutely everything at it. My goal was for a simplistic model, like we had with strikeouts, so I was hoping the five-factor model would do as well as the seven-factor one when we looked at year-over-year correlations.

At this point, colleague Chris St. John pointed out to me that, while contact rate and zone-contact rate don’t correlate strongly with unintentional walk rate, they do add value in a multiple regression. By adding contact rate and zone-contact rate to "xxBB%" above, I was able to bump the R-squared to .62 without increasing the standard error.

As a Predictive Model

I used xBB% and xxBB% to see if they could beat BB% alone in predicting the next year’s walk rates. Our sample was now limited to 669 pitcher seasons, as this filters out all 2012 seasons (since we don’t have enough 2013 data yet) as well as any pitchers who failed to face 350 batters the following season.

Metric R2 w Year 2 BB% RSME
Year 1 BB% 0.388 0.016
Year 1 xBB% 0.370 0.016
Year 1 xxBB% 0.328 0.017
Year 1 F-Strike% 0.234 0.018
Year 1 5-factor 0.329 0.017

And more bad news. Despite our ability to get the descriptive nature of the model up as high as 0.62 in R-squared, the model does a poor job of predicting year-over-year walk rate, falling short of just the previous year’s walk rate itself. Here, the contact rates even take us backwards a step.

Our most predictive method for determining future walk rates, then, appears to simply be…walk rate.

This is disappointing but not all that unexpected – we said at the outset that we thought walks would be more difficult to model.


I don’t think we’re done here, yet. Even though we couldn’t come up with a strong predictive model for walk rates, leaving more than 60% of the variance year-to-year largely unexplained. Smarter people than myself can probably make suggestions for how to improve the efficacy of this study, perhaps suggesting a different methodology or another factor I may have missed that would be a good predictor of future walk rates.

Beyond that, we can also continue to work towards some proxy for simplifying pitcher performance, and the next step might be to try to predict K-BB%, which has been shown to be a strong predictor of pitcher performance. Perhaps walks alone haven’t given us any greater insight, but maybe there is something in the K-BB relationship that will present itself to us.

Thanks to Chris St. John and Bill Petti for the discussion during the writing of this piece.