(Everything in bold is from me and everything in quotes is from Colin. You can find his work at Baseball Prospectus here and find him on twitter here.)
Your title states that you are the Director of Statistical Operations for Baseball Prospectus. What does that mean exactly and what all does it entail?
Well, what it means is that I'm responsible for the statistics that appear on our website. We have a server dedicated to producing our stat updates (both the nightly updates of stuff that appears on the cards, sortables and reports like the Playoff Odds, and the bigger offseason stuff like PECOTA), and I'm responsible for developing and implementing the stuff that goes into all of that. This gets split into doing research on things we could be doing, implementing it and maintaining it. We also do a lot of one-off projects for our authors, and I help out with that from time to time, although the bulk of that is done by Tim Collins and Ryan Lind. I end up writing a lot of SQL code, although we use some other scripting languages as well (primarily shell scripts and Python).
So naturally you have to be very handy with coding but what about the sabermetric side of it; is the coding more complex than either learning or staying up to date with what's happening within the field of sabermetrics, or is it the other way around?
You're always learning new things. There's really three parts to it -- one is coding, one is statistics and one is baseball. So you want to read up on sabermetrics, definitely. And not just in sabermetrics, but what people are writing about baseball in general. And not just the current stuff -- current stuff is good, but you want to have a bit of historic reading done too. Patriot has a lot of good stuff on his site, for instance.
But you certainly want to be reading non-sports statistics stuff now and again, as well as stuff on computer programming. They're your tools, after all, and you're going to work quicker and better if you understand your tools. I see a lot of people who talk about all they know is Excel, and you're really limiting yourself if that's the heaviest tool in your skillset. It's not going to serve you well for big datasets, and it makes reusing things a real pain. I do a lot of research in straight SQL, I'll do a lot of things in a software package called gretl (which is free to download and really easy to use), I'll dip into GNU R or Python for some heavier number crunching. And, y'know, better tools won't make a bad analyst good -- you absolutely need to start off with a firm grasp of both baseball, statistics and data collection. But better tools empower a good analyst a lot more. So I'd say it's more important and more difficult to learn the analysis and sabermetrics portions of it.
For those that want to become a better analyst, what would your recommendation be for where they should start? For example, perhaps they are very interested in sabermetrics but just aren't all that up to speed on everything that has been developed or continues to be expounded upon.
What's the foundation for any good analyst, in your opinion?
The first thing you want to do is get your hands on some data. You can get the Lahman DB, you can go to DougStats and copy pasta it into a spreadsheet, I don't really care. Just, get yourself some data.
The next thing you want to do is read. Go to Patriot's old Tripod site, or his blog for that matter. There's a ton of old stuff at Baseball Prospectus (and the older stuff is all outside of the paywall) and The Hardball Times for you to read. Go find anything on those sites written by Wooler, Click, Silver, Davenport, Fox, Gassko, Studeman... read it. Pick up a copy of Baseball Between The Numbers. Get a copy of The Book, which is back in print. Here's my recommended two-week reading plan for getting the most out of The Book:
1) The Toolshed.
2) The Appendix.
3) The Toolshed.
4) The Appendix.
5) The Appendix.
6) The Appendix.
7) The Appendix.
8) The Appendix.
9) The Appendix.
10) The Appendix.
11) The Appendix.
12) The Appendix.
13) The Appendix.
14) The Appendix.
Now, I am not guaranteeing at this point you will understand everything that's in the appendix. The appendix is hard. Keep trying until you get it, it's worth it. (Don't actually read it every night for over a week, give yourself breaks from it.)
And while you're doing all this reading, you should be following along. Remember that baseball data I told you to get? When you're doing your reading, you should be looking for things you can duplicate using the data you have. So if you read an interesting study on, say, Pythagenpat, pull up a spreadsheet or whatever and actually put in the formula yourself using real data and see what it says. Use different versions of BaseRuns and Runs Created. Get your hands dirty and play. If you come across a study you like, try and duplicate it yourself and compare your results to what the author got.
And don't necessarily worry about getting current right away. A lot of the old stuff is a lot simpler to use and understand, and still gives pretty good results. What you want to do is break it all apart so you can figure out how it works, not just use the latest and greatest stuff because it's what's now.
What about the realm of public research? You see many websites or companies getting into the world of sabermetrics and data analysis but keeping it to themselves unless you pay for it. Do you feel that the era of public sabermetric research and free data are over?
I think there's always going to be some free data in the world. People like Sean Lahman and all the wonderful, wonderful people behind Retrosheet are going to make sure of that. MLB Advanced Media has been very generous with the public on things like Pitch F/X and their play-by-play data and so on. In terms of free access to data and the tools to do something with it, we're really in a golden age. Thirty years ago, teams didn't have access to the data and tools that every baseball fan can download for free. And some of that is just the massive advancements in computers we've had, but a lot of that is the great work of a lot of volunteers that sabermetrics owes an incalculable debt to.
And I think there's still a place for the "freemium" model, where some stuff is public and some stuff is paid for. At Prospectus, all of our TAv and FRAA and WARP is free to everybody. Stuff like PECOTA and the fantasy tools we've built around it like the Player Forecast Manager are paid for. Over at Baseball Reference, there's a lot of free stuff, and then there's Play Index which costs money. And I think there's room for everybody to benefit in that kind of an arrangement, I really do -- the sites get the money to keep the servers running and to pay people to keep doing the work, paying customers get the most access but everybody gets access to a lot more things than if you didn't have paying customers at all.
That said, you do see people talking about "Well, we can't go anywhere more with fielding until we get Field F/X," and I find that discouraging, because I think Pitch F/X was a case where nobody really knew what we had until the genie was out of the bottle, and while MLB has been very gracious about continuing to give everybody access to Pitch F/X, and I don't think that happens with the next big steps forward in data collection. And at the same time, teams are poaching a lot of our best talent. So, y'know, I wouldn't say public research is over, but it could be that we're seeing the sun setting on the time when teams were in some ways trying to catch up to where the amateurs were on quantitative analysis. We still have some advantages as public analysts -- public analysis means collaboration and peer review and bouncing ideas off each other. And it means getting to take advantage of unique skill sets -- there aren't thirty nuclear physicists out there studying baseball such that each team can have one, but someone like Alan Nathan can contribute a lot to the discussion and everyone else gets to build on that. But teams certainly have the advantage in data collection now.
Regarding defensive metrics and many stating that we won't really advance any further in that area without Field F/X data being made public and free. Do you feel that no further advancements can be made with defensive metrics and how we evaluate defensive performance without Field F/X being made publicly available? Are there any areas in which you feel that defensive metrics can be improved without it?
I think we've seen the pendulum swing back around a bit towards defensive metrics based on publicly accessible data. People like Sean Smith, Peter Jensen, Michael Humphries and myself have all worked on building up those sorts of metrics, and I think they stand up pretty well as a group. But I don't think any of us have created the best possible metric using publicly accessible data, so I certainly think there's room to improve. And there's new fields of study, like catcher framing, that are certainly still vibrant and growing.
What kind of work have you done in regards to building up defensive metrics and furthering that research? Is there any particular aspect of defense that you're most interested in or feel there is more opportunity to make contributions in?
Well, I've built all the defensive stats currently in use at BP -- our current FRAA is fully based on play-by-play data, but it's a very different approach than most other defensive metrics (except for Michael Humphrey's DRA, perhaps, and it shares a few features with Defensive Win Shares of all things). But I think we can continue to tune and improve FRAA in the future as well. We also have arm metrics and such.
I think there's work to be done on the relationship between pitcher and catcher past what's already been done in pitch framing, in terms of stuff like pitch sequencing and in terms of the running game. And that's all ground that's still pretty fresh, as far as I can tell.
It's interesting that you bring up pitch sequencing because I spoke with Matt Swartz back in March and he presented game theory as it pertains to baseball at the SABR Conference this year and I came away with the impression that pitch sequencing would fall within the realm of game theory.
Have you done any research into game theory as it pertains to pitch sequencing, or in general? What are your thoughts about game theory how it could relate to baseball strategy?
Game theory is one approach to studying the question, in terms of figuring out what decision to make, but there are other questions, like how the responsibility splits between catchers and pitchers, that probably yield to other analytic approaches.
Where does that responsibility fall, in your opinion? Should the catcher have more say in pitch sequencing, the pitcher, or should those calls come from the dugout? Furthermore, how would we quantify the ability to call such a game?
I wish I knew the answer to that question. There's things you can look at in terms of matched pairs or With Or Without You that can maybe hint at those things, though.
What areas of baseball and sabermetrics do you think need a lot more study and research, or would greatly benefit from it?
I think we're going to see a lot more work in pitcher-fielder interaction, fielder-fielder interaction and pitcher-catcher interaction.
When you say pitcher-fielder interaction, are you referring to the pitch-to-contact style of pitchers or something else entirely?
I mean a refinement of DIPS. Before DIPS, you had the idea that pitchers were 100% responsible for what took place while they were on the mound, which was clearly wrong. Then you have DIPS, which says "There is little if any difference among major-league pitchers in their ability to prevent hits on balls hit in the field of play." That's a big step forward, but there are SOME differences among major-league pitchers. So we've spent the past decade or so trying to figure out if DIPS is true, and what we've come back with is "mostly." But I think there's still a lot of ground that can be covered there.