/cdn.vox-cdn.com/uploads/chorus_image/image/48873353/usa-today-8667390.0.jpg)
Baseball is back. This weekend marks the beginning of play in NCAA's Division I*. which means it's time to take a quick dive into college baseball.
* - Other divisions started even earlier: Division III Bethesda Christian beat Occidental 8-6 on January 30 to kick off the college baseball season.
Last year, I introduced my open-source college baseball database (which I've recently updated), and showed a few example applications. I looked at win probabilities, how the new flatter seams helped increase offense, the stolen base breakeven point, and the value of bunting (honest).
But this time, I want to use someone else's data. Chris Long (now with the Detroit Tigers) has his own collection of useful college baseball tools on his GitHub. Let's use them to generate a season preview.
Predicting 2016 Records
The database includes game results and season win totals dating back to 1997. We can use these to build the simplest possible preseason predictor: a linear regression based off a team's winning percentage over the last few seasons. Rany Jazayerli famously did this with MLB data back in 2003 for Baseball Prospectus.
We replicated that analysis, taking only those seasons where a team was in Division I for the three previous seasons. That gave us 4,220 team seasons to work with, and the results were pretty similar to Rany's.
Y = .1153 + (.4396 * X1) + (.1944 * X2) + (.1329 * X3)
Here, Y is the predicted winning percentage this season, X1 was the winning percentage last season, X2 was the winning percentage two years ago, and X3 was the winning percentage three years ago. Comparing the predicted winning percentages to the actual winning percentages over our test database gives us promising results.
The best-fit line through these points is y = x, and the R2 of this line is .459. That means that an increase of one percentage point in the predicted record corresponds to a one-point increase in their actual record. The R2 tells us that just looking at the last three year's records explains almost half of the variation in year-to-year performance. This is slightly higher than MLB (Rany's best-fit line had an R2 of .402), but college baseball has less parity than its professional counterpart.
All this gives us an easy way to predict hundreds of team winning percentages. The Tableau data visual below gives predictions for all teams in D1, D2, and D3 with at least three seasons under their belt. The green dots correspond to their previous seasons' records, and the blue star to this year's projected record. You can use the search bar on the right to find your favorite team, and the check boxes to look up a specific conference.
Draft Prospect Similarity Scores
But maybe you don't have a team. Maybe you're into college baseball, but only for the MLB draft prospects. Luckily, there's an analytical tool for you too.
The Mahalanobis distance has the coolest name of all distance formulas, but also is useful to measure the distance between two vectors drawn from a particular distribution with a certain covariance matrix. This covariance matrix isn't present in the typical distance formula you learned in geometry, but is needed here because statistics like isolated power and walk rate aren't necessarily independent -- that is, when one increases, the other usually does too. Getting deeper into the math would not make for very engaging reading, so I'm going to skip it.
But say you have a bunch of college hitters, and you want to find comparable players. You would take a few statistics you thought were important and compute the covariance matrix between these statistics. Then, take the stats of the two players you wanted to compare, plug it into the Mahalanobis distance formula, and you get a single number that tells you how similar the two players are.
Chris's code uses walk rate, strikeout rate, isolated power, and BABIP as features for both hitters and pitchers, and adds innings pitched to pitchers. An adjustment is made for strength of schedule, and all statistics are given equal weight. For each player season between 2002 and 2015, the code returns the 20 most similar seasons.
I looked up the top 10 prospects on Baseball America's preseason top 100 draft prospect list*. The table below includes the MLB player in our database whose season best matched the prospect's 2015 season, as well as the Mahalanobis distance value (lower numbers mean more similar) and their rank on D1Baseball's top 300 preseason draft prospect list.
Name | Pos | College | BA | D1B | Comp | Dist |
---|---|---|---|---|---|---|
A.J. Puk | LHP | Florida | 1 | 1 | David Price '05 | .0671 |
Alec Hansen | RHP | Oklahoma | 2 | 2 | David Purcey '02 | .0466 |
Corey Ray | OF | Louisville | 3 | 6 | Ike Davis '06 | .1911 |
Buddy Reed | OF | Florida | 4 | 3 | Steven Tolleson '04 | .0902 |
Nick Senzel | 3B | Tennessee | 5 | 7 | Andy Parrino '06 | .1255 |
Kyle Funkhouser | RHP | Louisville | 6 | 12 | Elih Villanueva '08 | .0576 |
Connor Jones | RHP | Virginia | 7 | 4 | Michael Stutes '07 | .0037 |
Kyle Lewis | OF | Mercer | 8 | 5 | Chris Dominguez '08 | .3330 |
* - The ninth and tenth prospects, Oregon's Matt Krook and Georgia's Robert Tyler, spent 2015 recovering from Tommy John surgery and forearm strain, respectively.
Conclusion
You might think this analysis is pretty basic, and you'd be right. What I've shown is just the beginning of the types of things of what you can do with these data. But the good news is both the data and the tools to use is are freely available: Chris's similarity score code requires R and PostgreSQL. If you have the time and the #want, it's a wide open field for sabermetric exploration.
. . .
All statistics courtesy of NCAA. These statistics are freely available in MySQL-compatible format through Bryan's GitHub page. Special thanks to Christopher D. Long and Meredith Wills for their code.
Bryan Cole is a featured writer for Beyond the Box Score and still a Tulane Green Wave fan, which explains the dark green and light blue color combination. You can follow him on Twitter at @Doctor_Bryan.