It is never a good idea to bet on baseball, but I would like to propose a rule of thumb for predicting the outcome of a game. I aim to find a simple formula to calculate a win probability given the runs scored and runs allowed values for each team. The motivation for this came when I built my simulation to examine how long of a season is required for the best team to finish with the best record. My simulation did not include a mechanism for teams to play against each other because I didn’t have a good way to have two teams interact given their respective runs scored and runs allowed talent levels. With this formula I hope to provide means for running a quick and dirty calculation to arrive at a win probability to be used in simple simulations or for rough backoftheenvelope calculations.
Methodology
For each season from 2004 to 2013 I calculated each teams' runs scored and runs allowed in the first half of the season to use as predictors for the outcomes of games in the second half of the season. Choosing to use the first half of the season to predict the second half is a tradeoff I had to make. On one hand full year runs scored and runs allowed are going to be more stable (I presume, though this could be a future study), but using run values from a portion of games to predict the outcome of those same games gives rise to bigger problems.
Calculating win probabilities given a set of predictors is a problem that just screams logistic regression. Logistic regression is a method used to estimate the probability of a binary response. In this case the two options are win or lose. I will train a logistic regression model on the years 20042010 and then test its accuracy on the years 20112013. As for choosing the predictors, I decided to use the run differential per game. This proved to be just as accurate as other methods, such as using Pythagorean records and run ratios, but also has the added bonus of easier interpretation. Also for easier interpretation with no apparent loss in accuracy I boiled the regression equation down to use just the difference in run differential for each team and this value cubed.
Finally, after I ran the logistic regression to make interpretation easier, I ran a simple linear model to estimate the computed probability estimates with the same predictors. Since there is a limited range of possible values, I am able to do this without loss of accuracy and without getting out of the range of probabilities of less than 0 or greater than 1.
Results
The model accurately predicted the game outcome 55.7% of the time, only slightly better than if I had just guessed the home team would win every time, which would have yielded an accuracy of 53.3%. Even though the model can’t predict the outcome of a single game very well at all, it proves to be fairly accurate on groups of games. Here is a table breaking down estimates by tiers of win expectancy. For example, in 56 games I assigned the home team a win probability between 40% and 45%. The average estimate among those 56 games was 44.7%, but in reality the home team won only 35.5% of those games. Since all of those games were predicted losses for the home team (because the estimate was below 50%) the model was 64.5% accurate on that tier.
Tier

Number of Games

Mean Win Probability Estimate

Observed Winning Percentage

Accuracy

40% to 45%

56

44.7%

35.5%

64.5%

45% to 50%

802

47.8%

45.9%

54.1%

50% to 55%

1148

52.6%

50.6%

50.6%

55% to 60%

1113

57.4%

56.8%

56.8%

60% to 65%

661

61.8%

63.5%

63.5%

The above methodology has led me to arrive at the following linear model which predicts the probability of a home team win given the difference in the run differentials of the home and away teams.
E(Win Probability For Home Team) =
+ .5438
+ .0684*(Difference in Run Differential)
 .0049*(Difference in Run Differential)^3
To interpret the model I will first look to the intercept. The intercept says that given two evenly matched teams with identical run differentials, the model would predict the home team to win 54.38% of the time. For a one run increase in the difference of run differentials, the win probability increases by .0684.0049, or 6.8%. This can be seen visually on this plot.
The cubed term becomes significant only when there are two very unevenly matched teams. For teams with differences in their run differential less than one, the model can estimate the home team’s win probability using just the linear function, since the cubed term adds only to the thousandths place. If we look graphically at just the window from 1 to 1, we can see the plot is indistinct from a straight line.
Of note is how skewed the odds are in favor of the home team. This is nothing new, but it is nice to confirm the already existing rule of thumb that the home team is expected to win around 54% of the time. Graphically we can see that given two evenly matched teams, the model arrives at an estimated win probability of 54%. The equation also means that we would call a game a 50/50 split if the home team has a run differential that is about .65 worse than the visiting squad.
Conclusions
As I mentioned earlier, predicting the outcome of a game is a fool’s errand. No matter the talents of the two teams, it is difficult to give a very confident prediction. However, we won’t let this stop us from trying. I set out here to establish a rule of thumb for establishing the probability of a win given two teams' runs scored and runs allowed. Since I fitted a cubed model, estimated probabilities are contained in a range from 44.5% to 64.0%. Given two equally matched teams, the home team has an estimated win probability of 54.4%. For one additional run per game in the difference between their run differentials, they tip their estimated odds of winning by about 6.8%. We also learned that there are diminishing advantages at the extremes; a difference in run differential greater than 2 does not help that much. This gives us a simple way to model the interaction between two teams given their approximate runs scored and allowed talent levels. I hope to use this equation to revisit my simulation to examine whether or not this dramatically alters the results.
. . .
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".
Daniel is a junior at Colby College and contributor to Beyond the Box Score and The Hardball Times. You can follow him on twitter @dtrain_meyer.