Baseball Databank Data Dump, I: FIP Constants and Runs Per Win
Hi. I'm Matt. I write here (occasionally) now.
Perhaps I could have picked something a bit more exciting for my first post, but I kept missing stuff, getting busy with other things, and so on. But I think this might be of utility to some people -- I know there were times I've wished I could easily access stuff like this.
Like many people, I originally got into SQL databases for baseball thanks to a few posts by Colin Wyers at the sadly-departed Statistically Speaking blog (Colin has since posted new versions of his introductions at The Hardball Times.)
This isn't about SQL, though. It's about putting some data out there for people who want to use FIP scaled to a particular season or need a runs-to-win conversion but don't want to mess with the database stuff themselves. I'll also publish the queries down below for those curious (they are very simple), which I came up with for use in MySQl and the pitching table in the freely available Baseball Databank.
FIP Scaled Historically
You probably already know about FIP (Fielding Indepedent Pitching), created by Tom Tango. (If not, click here for a good introduction), the simple and beautiful stat that gives us an idea of how well a pitcher performed independently of batted ball luck and the fielders behind him by focusing on things he can control. The basic formula is:
((13*HR)+(3*(BB+HBP-IBB))-(2*K))/IP+constant
Where the "constant" is usually around 3.2, but is adjusted for each season/entity to that the leagues season/entity ERA is the same as the season/entity FIP. Obviously, it's a pain to calculate each season individually, but that's why computers were invented. So I came up with an easy query for that. Not only that, but if you're like me and many others, you hate the "earned run" rule for various reasons I can't get into here. Not only that, but you realize that when, say, calculating Wins Above Replacement, using a stat scaled to ERA undervalues a pitcher relative to position players. So, instead of scaling FIP to ERA, why not to RA? Well, I did both with this query. And here are the results for easy bookmarking:
Historical FIP Scales, 1871-2009
The numbers in the "FIP-ERA" and "FIP-RA" columns are the "constants" as listed above that you need to use to scale them to the ERA/RA for the season. I have two separate sheets, one that groups all the major leagues by season, and the other that groups by season and league (i.e., separates AL and NL).
Runs Per Win
You can read the link in the heading if you aren't sure what this means. In the contemporary baseball era, you're pretty safe assuming that 10 runs = one win. But sometimes you want to be more precise, so I've come up with a table for that, too, comparing the results of three different runs per win estimators.
- Simple Runs Per Game (or lgRA times two).
- Tom Tango's simple formula: lgRA*1.5+3
- Patriot's PythagenPat version: (league Runs Per Game^(1-z)) times two, where z is between 0.27 and 0.29. I use 0.287 because that's what I read originally and I'm too lazy to figure out if I should change it.
Here is the table:
As noted in the first link in this section, the Pythagorean win expectation is the heart of the runs-to-win conversion, and since PythagenPat is the most accurate win expectation estimator across all contexts, so will its run to win conversion be the most accurate (although, again, you're pretty safe just going with 10 in a pinch). But I actually still use Tango's version in a lot of my SQL's because it's easier to work with quickly, and I'm not sure the loss in precision is worth all the effort, although I will eventually change all my SQLs over to PythagenPat. And, as you can see from the table, in the contemporary era there really isn't all that much of a difference, and even RPG isn't that far off.
I hope this is useful to someone out there, and if not, well, I hope I'm more interesting next time. I'll spare you the SQLs on the front page, but I do have them after the break for those that are interested.
These are for MySQL with the Baseball Databank. Both of these are the "group by league" versions, but if you take out the lgID in the "select" and "group" lines, it will group by all the major leagues of that year.
FIP Constants
SELECT p.yearID, p.lgid , SUM(COALESCE(IPOuts,0)/3) AS IP ,((SUM(COALESCE(ER,0))/SUM(COALESCE(IPOuts,0)/3))*9)AS lgERA , ,((SUM(COALESCE(ER,0))/SUM(COALESCE(IPOuts,0)/3))*9) - ((SUM(COALESCE(HR,0)*13)+(SUM(COALESCE(BB,0))+SUM(COALESCE(HBP,0))- SUM(COALESCE(IBB,0)))*3-SUM(COALESCE(SO,0))*2)/(SUM(COALESCE(IPOuts,0))/3)) AS FIPERScale , ,((SUM(COALESCE(R,0))/SUM(COALESCE(IPOuts,0)/3))*9) AS lgRA , ,((SUM(COALESCE(R,0))/SUM(COALESCE(IPOuts,0)/3))*9) - ((SUM(COALESCE(HR,0)*13)+(SUM(COALESCE(BB,0))+SUM(COALESCE(HBP,0))- SUM(COALESCE(IBB,0)))*3-SUM(COALESCE(SO,0))*2)/(SUM(COALESCE(IPOuts,0))/3)) AS FIPRAScale FROM bdb.pitching p GROUP BY p.yearID, p.lgid ORDER BY p.yearID DESC
Runs Per Win
SELECT yearID, lgID
, (((SUM(R)/SUM(IPOuts/27))*1.5)+3) AS RPWTango
, (SUM(R)/SUM(IPOuts/27))*2 AS RPG
, 2*(POW(SUM(R)/SUM(IPOuts/27)*2, 0.713)) PythPat
FROM bdb.pitching
GROUP BY yearID, lgID
ORDER BY yearID DESC
;
15 comments
|
2 recs |
Do you like this story?
Comments
Very cool stuff
I’m definitely going to find a bunch of ways to use this stuff.
I like baseball.
I write for Beyond the Box Score and The Hardball Times Fantasy
by Satchel Price on Apr 8, 2010 12:35 PM EDT via mobile reply actions
Thanks, I hope it's usuable
I hope I’m right in assuming that people can round for themselves?
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Apr 8, 2010 12:37 PM EDT up reply actions
Welcome Matt
Neat to see how rapidly and semi-permanently the FIP constant drops below 3 once you get into the mid-90’s. The classic 3.2 constant estimate works pretty well most of the time these days, but it’s sort of at the extreme high end of the variation.
How ’bout a plot of FIP constant vs. year? :)
-j
I write at:
Beyond the Boxscore | Red Reporter | Basement-Dwellers.com | Twitter: @jinazreds
Sounds easy enough.. I'll do it in the next week or so or whatever
do you want it scaled to ERA or RA?
ON a different topic, did you ever come up with an easy way of getting custom lwts scaled to run environment w/ BaseRuns that didn’t totally overweight every event?
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Apr 8, 2010 12:45 PM EDT up reply actions
And, yes, I've read Patriots articles
I’m just too dumb
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Apr 8, 2010 12:45 PM EDT up reply actions
Not really
The base runs equation that I published in the minor league run environment article at THT last month does a good job across most run environments in modern baseball (major and minor leagues). Misses by like 8 runs per season on average. I might have too much weight on the outs, or else I’m missing too much baserunning or something. But I forced the b-coefficients to match up to Colin’s empirical linear weights for 2007-2008 MLB rather than actual run totals. I like that approach better.
Oh, and I was just talking about the ERA constant…though it might be interesting to show how they diverge as the error rates pick up back in the no-gloves era.
Cheers,
Justin
I write at:
Beyond the Boxscore | Red Reporter | Basement-Dwellers.com | Twitter: @jinazreds
Actually, I was thinking of just posting the recent weights and discussing them
I’m not a SQL teacher, I just put them in here because I hadn’t seen t hem before and to “show my work.”
Might also do historical BaseRUns-FIP at some point
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Apr 8, 2010 3:14 PM EDT up reply actions
That is great Colin
I tried it around when BDB came out. But the wOBA numbers I got were about 5-15 points lower than Fan Graphs which derives their numbers from the same data set. I think it might have been my wOBA equation.
rzar.wordpress.com
draysbay.com
raysprospects.com
by RZ on Apr 8, 2010 4:09 PM EDT up reply actions
Pythpat RPW
The Tango linear formula is actually a cousin to the more involved Pythagenpat formula. It’s basically what you get when you linearize the Pythagenpat formula for a R=RA team at 9 RPG. (Tango developed it independently of Pythpat, FWIW). So the two are very much related and for normal enviornments, you lose very little by using the simple formula.
thanks
interesting
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Apr 9, 2010 2:13 PM EDT up reply actions



















