Gulp. This is a tough game. No, I am not talking about baseball, but sabermetrics! While on the subject I agree 100% with Bill James that sabermetrics is an unfortunate word to use to describe the statistical analysis of baseball. For me it conjures images of socially inept math professors, who are completely detached from reality pontificating findings from their ivory tower. Anyway, this article is not interested in a debate about the misfortune of the term sabermetrics; it is to bemoan some of the frustrations that come with our chosen hobby.
Only for the lucky or talented few is the study of baseball statistics a full time profession. For the rest of us it is something that we dedicate our spare time to. This means that a wannabe sabermetrician not only has to be a baseball and math expert, but also be a database guru, programming junky and spreadsheet geek. Only then can we begin to lift the fog in which baseball statistics generally wallow. So how can a budding baseball enthusiast navigate this treacherous course? One option is to do as I did and log on to Amazon.com and purchase a copy of a new book called Baseball Hacks by all-round baseball stat super sleuth, Joseph Adler.
Baseball Hacks is a 400 page text dedicated to helping Joe Average get started in the world of hardcore baseball analysis. Want to know how to build a 30 year play by play database, or how to work out win expectancy? Then this is the book for you. You may need a degree in Computer Science to understand exactly what you are doing, but the chances are that you'll find a nugget or two in this weighty tome.
The book is ordered into seven chapters, and each chapter is organized in to a series of hacks focusing on a particular topic, of which there are seventy odd. The first couple of chapters introduce the nuts and bolts of baseball analysis, from the mundanely simple (how to read a box score) to the fiendishly complex (building a 30 year play by play database using Perl and SQL). Later chapters discuss interpreting and analyzing data: a chapter is dedicated to graphical presentation; while another shows how to calculate all kinds of exotic sabermetric statistics. In short, if you were to read and understand the book from cover to cover you could drop Marc Normandin a quick email and probably get a writing job on this blog, perhaps replacing me!
The authors (although Joseph Adler is the named author it turns out that the hacks are drawn from a reasonably wide collection of different analysts) sensibly take advantage the plethora of free tools available on the Internet, so much of the early chapters serve as an introduction to the software packages and programming languages required to sniff out and process data and analyze statistics. And herein the challenges for the reader, and the authors for that matter, begin. Introducing a package like MySQL in one 400 page book would be an accomplishment, but attempting to synthesize it in 3 pages, and then expecting the reader to be fluent, makes the odds of successfully implementing a particular hack on the first attempt about as likely as hooking up with hottest girl at the school prom!
After attempting a few hacks one begins to muse whether the authors are perhaps guilty of over complication in order to show off their programming credentials, which are admittedly impressive. There is little doubt that this book has been written by techies for techies. Take the hack for the PBP database which it turns out only works in UNIX (it took me half a day to discover this). Whoa: flashing lights and sirens. Please excuse the rant that follows ... but why do books like this assume everyone runs UNIX? Sure, if you're operating business critical applications then UNIX is great, but this book is aimed at you and me, enthusiasts, who aren't going to have an Oracle database next to their fridge. Yes, we use Windows. So for this book to be truly accessible it should be written for Windows users. Something to focus on for the second edition methinks.
One amusing thing that caught my eye as I flicked through the pages was the number of shameless endorsements for other books by the same publisher (O'Reilly). After each plug the authors bashfully add that they were not asked to push sister books but were doing so because they are so great! Maybe so, but after the 23rd O'Reilly recommendation it starts to get a little unnecessary - I mean we get the idea, go and buy the entire O'Reilly back catalogue! Saying that if you buy this book then it is almost obligatory to get a couple of others about MySQL and Perl, if for no other reason than to reference when you get stuck (which I guarantee you will).
However, some of these frustrations beside, I have to commend the authors on a valiant attempt in penning this book. In actuality Baseball Hacks is a very valuable resource that, if properly used, can alleviate the pain of complex data gathering and tricky analysis. Simply put, there is no other book like it and that makes it a required text for all sabermetricians, aspiring or otherwise. If you are serious about baseball stats you've got no choice to buy it, but I urge you to heed an immortal line from Charles Dickens: it was the best of times, it was the worst of times. That is how you will feel after you have plowed through this book.
As a postscript:
After a couple of agonizing days I now actually have a fully working 5 year PBP database in MySQL! Given the time and effort it took to get working I figured that I might as well put it to good use. What I want to do is run a series of analyses on this PBP database, with the intention of publishing my findings on BtB. To make it more interesting I want to get reader input. So every couple of weeks I'll pick the most interesting suggestion and run the analysis. Email any ideas to me. In the meantime I'll have to spend my spare time furiously learning SQL properly! The first post will run in about a month or so as I have a few other small projects that I am working on that will soon be ready for publishing.