In this article, we present a predictive power ranking of national baseball teams based on the Elo rating system. This system objectively measures teams' relative strength based on past performance, strength of schedule, run differential, and the event importance. To our knowledge, this system is unique: although the International Baseball Federation (IBAF) releases yearly rankings of national baseball teams based on performance in international competitions, this ranking includes results across multiple age levels and is not designed to predict future results.
Elo System Basics
Since its development by Arpad Elo in the 1950s to rank chess players, the Elo* ranking system has been adapted to a number of team sports, including Major League Baseball, international soccer, and NCAA football. The system consists of two main steps: using teams' existing rankings to predict an upcoming game, and then using the result of the game to update the teams' rankings.
* - In Silver's article, he suggests that the name should be pronounced E-L-O, like "the band that has inspired so many hapless karaoke renditions over the past thirty years." But in his native Hungarian, Arpad Elo's last name is pronounced something like "AY-leu."
A good primer on the Elo ranking system can be found at this site. As a quick illustration, let's look at how our system would have handled the 2013 World Baseball Classic final between the Dominican Republic and Puerto Rico, played at AT&T Park in San Francisco. Before the game, the ranking difference between the two squads was 106 points. In addition, the Dominican Republic gets a small "home-field advantage" bonus, since they batted last in this game. We will show in a later section that home teams win about 52% of their games against teams of equal ability, and therefore adjust the ranking difference by 13 Elo points (about two percentage points) to produce the final pregame win expectancy.
|Team||Old Elo||HFA||Win Prob||Score||Delta||New Elo|
The Dominicans shut out Puerto Rico 3-0 to win their first championship. The two teams now exchange ranking points -- that is, Puerto Rico's ranking decreases by the exact amount the Dominican Republic's ranking increases. The amount the rankings change depends on the importance of the game, the margin of victory, and the expected result: had Puerto Rico pulled off the upset, the rankings would have changed by a much larger amount.
Creating an International Baseball Ranking
This system requires a large amount of information about previous games between as many teams as possible. Since a suitable publicly-available database could not be found, we built our own, containing basic information on 6,700 games between 104 senior-level teams dating back to 1912. For each game, we tried to collect final score, location, home team, whether the game went to extra innings, though not all of this information is available for all games. For now, the database is hosted here.
Before we could use the database to calculate teams' rankings, we first had to decide how much to adjust the ranking difference to account for home-field advantage. This is complicated by the fact that there are three types of home-field advantage in international baseball: the advantage enjoyed by the team batting last on a neutral field, the advantage a host team has when batting last, and the (presumably smaller) advantage a host team has when batting first.
Unfortunately, we can't just look at the winning percentage of all teams in each of these three scenarios: teams with better baseball facilities are more likely to host events, so we have to compare their actual winning percentage to an expected winning percentage. We calculate their expected winning percentage from the teams' respective Pythagorean records using the odds ratio method.
We first focused on games played on neutral fields, selecting a subset of 448 games dating back to 1992 played between a group of 20 teams. These teams were chosen to be about the same ability level -- all 20 had a Pythagorean record between .300 and .700 in this subset. Games with missing information (especially location or which team batted last) were not included in this subset.
The logged-odds-ratio was plugged into a logistic regression, which in turn produced an expected winning percentage for the "home" team on a neutral field. Because we did not have enough information to compile similar subsets for teams away from neutral fields, we could not extend this method to the other types of home-field advantage. We therefore relied on Shane Tourtellotte's research to extend our findings. The following table shows our current ranking adjustments, along with the winning percentages we would expect for the team batting last in each scenario.
|Situation||Last-Bat Win %||HFA|
|Neutral site||51.8||13 pts|
|Host bats last||57.4||52 pts|
|Host bats first||46.3||-26 pts|
To establish our model's accuracy, we produced an expected winning percentage for each game in our database where both teams had already completed at least 25 games. We then used a calibration test to compare our predicted outcomes to the actual results. The calibration test tells us if the teams we expect to win, say, 70% of the time, actually win 70% of the time. The results of this test are presented in the figure below, along with a black line representing the equation expected WP = actual WP. This line is a good fit for our data; in fact, it has an R2 value of .97.
For a more quantitative metric, we also calculated our model's Brier Score. The Brier Score is essentially the mean squared error of our predictions:
Here, N is the number of predictions, ft is the forecast probability, and ot is the outcome as a binary variable. Nearly 3,000 games were used to validate our system; in these games, our model has a Brier score of 0.166.
In this section, we present the rankings produced by our model. Although we have 104 national federations represented in our database, Elo rankings typically take several games to stabilize. Any teams who do not yet appear in at least 25 games in our database are assigned a provisional ranking; while a team is in this stage, that team's results affects its Elo score but not that of its opponents. For example, when Latvia lost to Finland in the recent European C-Level championships, Latvia's ranking decreased by the expected number of points, but Finland's did not change. This prevents teams from padding their ranking against inexperienced opponents.
This table presents the current Elo rankings as of August 12, 2014, for the 67 national federations who have graduated from provisional status. The provisional rankings are presented on a separate sheet.
The top of the rankings are naturally very similar to the IBAF rankings. But the rankings vary significantly further down the list. For example, Pakistan is currently ranked in IBAF's top 25, but because our system weights margin of victory and strength of schedule more heavily than the IBAF system, we rank Pakistan outside of our top 40. The importance of strength of schedule is also demonstrated by the team immediately above them: although the Bahamas has a much worse record and run differential, Pakistan plays a significant number of games against weaker competition, such as their victories over Afghanistan, Sri Lanka, and Nepal by a combined 56-0 in the 2013 West Asia Baseball Cup.
As an example of our system in action, we present our model's predictions from the recently-completed Haarlem Baseball Week, a biennial tournament held at Pim Mulier Stadium in the Netherlands. This year's tournament featured four national teams competing in a double round robin to determine seeding for the a four-team, single-elimination tournament that determined the champion. To simulate the tournament, we created a 10,000-iteration Monte Carlo simulation using our model's win probabilities. This table presents our model's championship probabilities as predicted before the tournament, as well as the actual results.
|Team||Elo||Pred. Champ.||Actual W-L||Place|
At this stage, we feel comfortable that our power ranking system can be used to predict both individual games and tournaments between senior-level national baseball teams. To demonstrate this, we have included predictions in this article for the recently-completed Haarlem Baseball Week. Simulations of additional events (including the upcoming European Baseball Championships, Asian Games, and Central American and Caribbean Games) will be released in the coming weeks.
In addition to including future games, we will continue to work to look for additional historical games to add to our database. This will allow us to both fine-tune our model (especially elements like the home-field advantage and the initial rankings). With enough historical games, we could even publish historical rankings to show how teams have improved over time. We are currently corresponding with national baseball federations and governing bodies directly to see if they have any additional information in their archives.
. . .
The author is extremely grateful for the many resources used to develop the extensive results database, including Peter Bjarkman's encyclopedia of international baseball, the Taiwan Baseball Hall Wiki, Mark Cruickshank at The Roon Ba, and representatives from those national baseball federations who responded to requests for information. He would also like to thank Neil Weinberg for his help condensing this article. Additional details about the methodology, including the #gorymath, can be found in this PDF.
Bryan Cole is a featured writer at Beyond the Box Score. He will be presenting this work at this weekend's Saberseminar, and is most looking forward to the Trackman radar demonstration. You can follow him on Twitter at @Doctor_Bryan.