How To Build And Use A Baseball Database
Colin Wyers has a fantastic post up over at Statistically Speaking about setting up and using a relational database for doing baseball analysis. If you're like me and still stuck using Excel, you'll want to check this out. And it's a Part One, meaning there's more to come. (Unless Colin is a Mel Brooks fan.)
about 3 years ago
Sky Kalkman
21 comments
0 recs |
Comments
Very interesting
I am a huge fan of MySQL, especially for large databases. The learning curve is large and steep though.
I would think that people who are at the point where they’re struggling with Excel would start using Access. You can do queries, it’s simpler to use, and you can do all sorts of math in it.
Stupid question: where do you get all of this data from?
Curt never met a buttered roll he didn't like.
Access does a lot of nice things for you.
I’m not as comfrortable with it, though, and it costs quite a bit of money. (It also doesn’t handle large databases as well, and that’s an issue with Retrosheet data – I think that my events table is up to 8 GB, partitioned by year.) You can use Access as a frontend for MySQL, actually.
The problem with Excel isn’t that it’s hard, per se, but that it’s not meant to do a lot of the stuff that sabermetricians do.
As for a data source, start with the Baseball Databank.
I keep forgetting there are people who didn't get access with excel
And thanks for the link
Curt never met a buttered roll he didn't like.
like me
OMG Banny. FWIW I am only crdtng u w/3 runs allwd bc of DDJ OMFG
by Matt Klaassen on Nov 14, 2008 9:02 PM EST up reply actions
Also, there's no Access for the Mac
Anyone have thoughts on the OpenOffice database program?
And, regarding Access as a front-end for MySQL, how difficult is that to set up? Does it basically just remove the file-size limitations that Access has? Do you lose any Access functionality? Do you still have to deal with MySQL at all?
Beyond the Boxscore // Calling BJ Upton lazy is lazy.
by Sky Kalkman on Nov 15, 2008 12:20 PM EST up reply actions
I do 99% of my stuff in Excel.
But I think I’m going to try this. It seems easier, once you get it down pat.
Is MySQL really that tough?
Has anyone worked with the Open Office database? I haven’t, but it it’s like Access, it might be a good option.
OMG Banny. FWIW I am only crdtng u w/3 runs allwd bc of DDJ OMFG
for free, that is
OMG Banny. FWIW I am only crdtng u w/3 runs allwd bc of DDJ OMFG
by Matt Klaassen on Nov 14, 2008 9:03 PM EST up reply actions
OOo Base is, shall we say, poorly documented.
The screenshots look similar to Access, but the documentation is very sparse.
As for the question of how tough MySQL is – download it, go through the tutorial, and let us know! I’d like to think I’ve written something to at least get you started.
I did it last night!
I didn’t do the “extra credit,” but I think I did OK. It’s not really running in the background is it?
I like that it seems to be smaller and more efficient that I assume Access would be.
I’m eagerly awaiting Part 2, so that I can start completely move out of Excel for all my wasting-time-pretending-to-be-a-sabermetrician needs.
Maybe there can be a tutorial on how to take a table generated in Google Spreadsheets, MySQL, Excel, etc., and formatting it decently for SB Nation posts…. see my lame attempt (after way too much time spent on it) here.
OMG Banny. FWIW I am only crdtng u w/3 runs allwd bc of DDJ OMFG
by Matt Klaassen on Nov 15, 2008 11:41 AM EST up reply actions
oh, and thanks a bunch, this is a huge help
can’t wait for your do-it-yourself-Marcels
OMG Banny. FWIW I am only crdtng u w/3 runs allwd bc of DDJ OMFG
by Matt Klaassen on Nov 15, 2008 11:41 AM EST up reply actions
for extra credit...
write your own relational database software in Python!
(That’s what I did, not that I’m proud of it, but I started programming long before I knew anything, so I wrote a lot of Python code that clumsily does things that other programs/languages/etc do much better. But hey, I know my own RDBMS really, really well!)
Also, cheese.
question from a noob
Colin: just tried to do the “extra credit” and keep running into errors. Here’s what I have:
1 select player id,
2 sum (ipouts)/3 as inn,
3 sum (er)/(inn*9) as era
4 from pitching
5 group by playerID
6 having ipouts >= 3000
7 order by era asc
8 limit 50;
I keep getting
Error Code : 1054
Unknown column ‘player’ in ‘field list’
OMG Banny. FWIW I am only crdtng u w/3 runs allwd bc of DDJ OMFG
Make playerid one word in line 1 maybe?
by R.J. Anderson on Nov 15, 2008 4:08 PM EST up reply actions
Right.
I would also remove the spaces between the sum commands and the parenthesis.
Also, you’re can’t use an alias as a data column. So the final result looks like this:
select playerid,
sum(ipouts/3) as inn,
sum(er)/(sum(ipouts/3))*9 as era
from pitching
group by playerID
having inn >= 1000
order by era asc
limit 50;
Good catch on using Asc in the order by clause instead of Desc.
and thanks for trying to compliment the one thing I did right
heh… knowing when and when not to use commas is really a challenge for me… I got it, though. Thanks.
can’t wait to figure out how to generate wOBAs and bRAAs for the entire history of baseball (although I know the weights wouldn’t be right for each era)… then add in positional adjustments and replacement value.
That might be a while.
Is this what you use to generate projections, Colin?
OMG Banny. FWIW I am only crdtng u w/3 runs allwd bc of DDJ OMFG
by Matt Klaassen on Nov 15, 2008 4:46 PM EST up reply actions
This is exactly what I use.
I’m working on Part II of the tutorial right now, and I’m working on cleaning up that code for release, and hopefully that’ll go up either tomorrow or on Friday.
Formulas aren't tricky
Multiply each stat by the appropriate constant:
Instead of just H+K+BB or whatever, do (Hx3) + (Kx-5) + (BBx7)
And replace the x’s with *’s. Damn SBN auto-format.
Beyond the Boxscore // Calling BJ Upton lazy is lazy.

















