Beyond the Box Score: An SB Nation Community

Navigation: Jump to content areas:


Pro Quality. Fan Perspective.
Login-facebook
New Blog: Once A Metro covering Red Bull New York!

Saberizing a Mac #3: Adding Win Shares & Other Data to the BDB

Based on the lack of comments in the previous post, I'm going to guess people had no trouble installing the Baseball Databank database.  If you did, feel free to ask questions in the comments of that post.  To follow the process I personally used to get it done, read this comment.  All Saberizing a Mac posts can be found here.

This next challenge was suggested by Jeff (TucsonRoyal).  The Baseball Databank is awesome, but what if there's additional information from other files that we want to add to the database?  How do we do that?  Well, let's figure it out, starting with Win Shares data provided by Dave Studeman.  There are two .csv files in there, both of which include BDB ids, allowing us a way to hook into the data that's already in the BDB.

Let's tackle the total WS file first, then the WS above average.  We have some choices to make.  Do we want this as a separate table in the database, or do we want to add fields to an existing table?  I'm not sure both are even an option with this WS data, but it's an interesting question to discuss.

Star-divide

As a challenge, let's see if everyone can get the data imported and then write a SQL query to find the ten oldest teams since 1980, where team age is defined as the weighted average of each player's age and their total Win Shares.  As an example, to find the 2008 World Champion Phillies' weighted age, you'd multiply Chase Utley's 2008 age by his WS, add in Ryan Howard's 2008 age times his WS, add in the same thing for everybody else, then divide by the total team Win Shares.  (I've chosent his definition because it requires us to use WS data, and weights team age by the players who are actually providing value to the team.  You could also weight by PAs or IPs to find the team age based on playing time instead of performance.)

Calculating age can be a bit tricky, and I'll refer people once again to Colin's two primers on SQL.  Colin also has some additional sample code available here.  He's offered to help folks with their own queries, which can be something else that happens in the comments of this post.  If you challenge yourself to come up with a query to answer an interesting question, but are struggling, ask for help.  If you succeed, share the code.

Once you've got the Win Shares data incorporated, try adding some of the data Tom Tango's made available here, like primary defensive position and wOBA.  And if you find any other data we can add, please share it.

Get your celebratory age-appropriate beverages chilled... set... go!

(And yes, this is something PC-users can play along with, too, although your results won't be as aestheticlly pleasing and you can't share in our yuppy attitudes.)

Poll
Are you playing along at home?
Yes!
13 votes
Yes, but on a PC.
7 votes
Not yet, but I will.
9 votes
I already have this stuff set up.
2 votes
I want to, but I'm scared.
7 votes
No, you're all dorks. (Keep in mind we know this already.)
5 votes

43 votes | Poll has closed

0 recs  |  Comment 27 comments |

Story-email Email Printer Print

Comments

Display:

My answer
Do we want this as a separate table in the database, or do we want to add fields to an existing table?

Definitely separate tables. Otherwise you have major troubles when either piece of data is updated.

If you want to get the data in one easily accessible “table” try creating a view. It doesn’t have any performance improvement over a straight SQL statement, but it saves you having to write the SQL each time.

by Dan Turkenkopf on Jun 10, 2009 9:20 PM EDT reply actions   0 recs

Sounds like good advice.

But what do you know about this sort of thing?

How is a view different from a query?

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 10, 2009 9:32 PM EDT up reply actions   0 recs

A view is simply a query that the database remembers.

And without getting too far into the normal forms and whatnot – it’s generally more logical to introduce a new table for these sort of things.

by cwyers on Jun 11, 2009 2:50 AM EDT up reply actions   0 recs

We could get into a normalization discussion though and really confuse Sky

What’s your feeling on the tradeoffs between normalization and performance? :)

by Dan Turkenkopf on Jun 11, 2009 7:45 AM EDT up reply actions   0 recs

When it comes down to it, I prefer performance.

The Retrosheet schema I use is not very normalized at all, and it doesn’t bother me any. (I don’t even think it follows the facts table setup of a data warehouse, but I’m not an expert here.)

by cwyers on Jun 11, 2009 11:24 AM EDT up reply actions   0 recs

Me neither, really

I’ve never bothered to normalize my retrosheet tables either.. they’re basically all 100 or so fields laid out in a single table. It suits my purposes so I haven’t spent a lot of time to make it better.

by Dan Turkenkopf on Jun 11, 2009 6:01 PM EDT up reply actions   0 recs

This first step has worked for me so far

In the Pro Sequel BDB database I created a table from the Query view:

 CREATE TABLE `BDB`.`2008_Historical_WinShares` (
`playerID` VARCHAR ( 10 ) NOT NULL ,
`yearID` VARCHAR ( 6 ) NOT NULL ,
`Age` VARCHAR ( 6 ) NOT NULL ,
`Multiteam?` VARCHAR ( 6 ) NOT NULL ,
`Multi-Same Team?` VARCHAR ( 6 ) NOT NULL ,
`stint` VARCHAR ( 5 ) NOT NULL ,
`teamID` VARCHAR ( 6 ) NOT NULL ,
`lgID` VARCHAR ( 4 ) NOT NULL ,
`Pos` VARCHAR ( 5 ) NOT NULL ,
`WinShares` VARCHAR ( 10 ) NOT NULL ,
`BatWS` VARCHAR ( 10 ) NOT NULL ,
`FieldWS` VARCHAR ( 10 ) NOT NULL ,
`PitchWS` VARCHAR ( 10 ) NOT NULL ,
`TotWS` VARCHAR ( 10 ) NOT NULL
) ENGINE = MYISAM

…then I imported the 2008_Historical_WinShares.csv file (File>Import…). I just went with the default settings.

(At least the data is there; I don’t claim to know what I’m doing. Your mileage definitely may vary.)

by RFK on Jun 11, 2009 2:01 AM EDT reply actions   0 recs

Sweet.

Can you run a query using the table?

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 11, 2009 8:46 AM EDT up reply actions   0 recs

Yes, here are the top 50 total Win Share years for the SF Giants, for example:

SELECT playerID
, yearID
, TotWS
, teamID
FROM 2008_Historical_WinShares
GROUP BY playerID
HAVING teamID = “SFN”
ORDER BY TotWS Desc
LIMIT 50;

What the heck, it’s something!

by RFK on Jun 11, 2009 11:47 AM EDT up reply actions   0 recs

Uhhh

yeah, what Colin said!

by RFK on Jun 11, 2009 5:24 PM EDT up reply actions   0 recs

If you want to know the truth

I created the table from the MAMP/phpMyAdmin/Operations/Create new table on database tool. It winds up giving you the code. Copy and paste, baby!

by RFK on Jun 11, 2009 5:32 PM EDT up reply actions   0 recs

And

I just aped Colin’s second code from the historic November 16, 2008 “Building a sabermetrician’s workbench, part II”. Colin is my Zeus!

by RFK on Jun 11, 2009 5:38 PM EDT up reply actions   0 recs

MySQL uses several different storage engines.

That just tells it which one to use. MyISAM is probably the best engine choice for the vast majority of the things that you’d want to use MySQL for (and by you I mean a sabermetrician).

by cwyers on Jun 11, 2009 4:19 PM EDT up reply actions   0 recs

Not really.

Although I can’t tell you what your install will default to (mine will make a table MyISAM by default). It’s not a bad idea to do so, however.

by cwyers on Jun 12, 2009 12:36 AM EDT up reply actions   0 recs

Uhhh

yeah, what Colin said!

by RFK on Jun 11, 2009 5:25 PM EDT up reply actions   0 recs

Challenge I'm working on...

For all teams to make the playoffs since, oh, 1980, where did they rank within their league in total homeruns?

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 11, 2009 10:14 AM EDT reply actions   0 recs

FWIW I'm working on this, too

But it might be a few days before I have time to get it together. (At least it’s easier than the Ten Oldest Teams Since 1980 Challenge.)

by RFK on Jun 12, 2009 10:30 AM EDT up reply actions   0 recs

Rank is tricky...

…as MySQL is missing some functionality in that regard. Here’s the workaround I use.

by cwyers on Jun 12, 2009 1:38 PM EDT up reply actions   0 recs

Whoops.

That works, but this is the one I use.

by cwyers on Jun 12, 2009 1:39 PM EDT up reply actions   0 recs

I'm having trouble

SELECT yearID, teamIDwinner, teamIDloser FROM BDB.SeriesPost WHERE `yearID` >= 1980

tells me that the Royals and the Yankees made the playoffs in 1980, and

SELECT yearID, lgID, teamID, HR,
@num := @num + 1 rank from BDB.Teams,
(SELECT @num := 0)
d
WHERE `yearID` = 1980
AND lgID = ‘AL’
ORDER BY yearID ASC, lgID ASC, HR DESC

tells me that the Yankees ranked #2 and the Royals ranked #9 in homers that season in the AL. But I can’t create a view of the second statement (because it includes a variable). Since I can’t, I’m lost from there, unless I do the rest of the years manually.

by RFK on Jun 15, 2009 10:42 PM EDT up reply actions   0 recs

If anybody is still paying attention to this,

I found another way to rank, so that a view can be created.

CREATE VIEW view_name AS
SELECT yearID,lgID, teamID, HR, (
SELECT count( * ) + 1
FROM Teams b
WHERE a.HR < b.HR
AND yearID = 1980
AND lgID = ‘AL’
) AS rank
FROM Teams a
WHERE `yearID` = 1980
AND lgID = ‘AL’
ORDER BY rank ASC

This again tells me that the Yankees ranked #2 and the Royals ranked #9 in homers that season in the AL. Then, if I run this query:

SELECT a.yearID, a.teamIDwinner, b.rank as rankw, a.teamIDloser, b.rank as rankl,a.round
FROM SeriesPost a, view_name b
WHERE a.yearID = 1980
AND a.teamIDwinner = b.teamID

…I get an answer, but it’s only half-right. Also, it’s just for 1980, not every season since. So I guess I have some more work to do.

by RFK on Jun 17, 2009 2:15 PM EDT up reply actions   0 recs

Still paying attention!

Not a lot of time to play with it, though.

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 18, 2009 2:04 PM EDT up reply actions   0 recs

Comments For This Post Are Closed


User Tools

We use numbers and stuff.
Community Guidelines
Why be a member?
Start posting on Beyond the Box Score »

Join SB Nation and dive into communities focused on all your favorite teams.

Connect_with_facebook

Cbs_fantasy_baseball_promo

FanPosts

Community blog posts and discussion.

Recommended FanPosts

Picture-2_small
Advanced Graphing Techniques Part 4 - Cluster Analysis in R

Recent FanPosts

Small
SB Nation Survey + Chance to donate $500 to a charity of this community's choice
Small
Spring Training Statistics?
Patrick_willis_small
WAR Fantasy draft
Small
Is It No Longer Sonny In Tampa?
Limes_125_small
Help With 32 Questions Contest
Otto_-_image_adjusted_small
Good reasons for getting past the divided leagues era
Small
Nathan's Replacement
Rays_small
Chasing the Grail, Part Two
Rays_small
Chasing Sabermetric’s Holy Grail or, Another Stab at Estimating Catcher Defense

+ New FanPost All FanPosts >

FanShots

Quick hits of video, photos, quotes, chats, links and lists that you find around the web.

Recent FanShots

Optimizing the Phillies and Yankees Lineups
Join BtB's Last Minute March Madness Pool
Dave Allen did this with home runs. I thought this might look cool for certain pitchers who had pinpoint command.
Irish Ballplayers -- Seems appropriate for today
Simpson's Paradox, Bert vs. Jack, and more sweet B-Ref splits

I've played fantasy baseball for many years. (My first team's rotation featured rookies Jason Bere and Aaron Sele.  Jay Buhner and Mo Vaughn anchored my lineup.) But I haven't played it well since 2003 or 2004.  My excuse?  Kids.

No, it's not that I fill my rosters with unproven youngsters.  It's that my wife and I have too many.  Too many to allow me the time necessary to have success in a competitive fantasy baseball league.

I've thought about hanging up my fantasy spikes but I really like playing it -not to win necessarily- but really just to keep an eye on good and great players from around baseball.  Players  like Adrian Gonzalez or Josh Johnson who of course come up on the each team's schedule sporadically throughout the summer but then seemingly disappear.  I like knowing about those guys and I like being able to talk to other baseball fans about those guys.  So I play fantasy baseball.  And I'm gonna continue to play.

...As long as at least two or three of you folks wanna join the rest of us in my new league: S(a.N.D)B.O.K.X. Fantasy Baseball.

The SAND is the tautologically redundant part of the acronym: Simple (and Not Difficult)

The BOKX explains just how simple (and not difficult) this league's scoring system will be: Based Only on Ks (strikeouts) and Xs (extra base hits).

* * * This league's scoring system has two only statistics: strikeouts and extra-base hits.* * *

Hopefully, this keeps things simple (and not difficult) and will thus require only a few minutes each week to seek out good hitters and good pitchers to replace not-quite-as-good pitchers and not-quite-as-good hitters.  (And hopefully, I'll have Ryan Howard on my team.)

We'll probably use only players from the National League and  unless things change for some reason, it'll be a points league. It's gonna be on Yahoo! so it'll be free and unless things change for some reason, we'll just be playing for bragging rights (and something to do).

I've had some help from Red Reporter's sabermetric higher-ups and I think I know how to weight things so that the NL's good, better and best strikeout pitchers are going to be worth roughly what the NL's good, better and best sluggers will be worth on draft day.  But I've never seen or heard of a league like this so I don't know quite what to expect and if anybody can think of any reason that this format might go beyond just being quirky and we'll like end up ripping a hole in the space/time continuum, help us out, eh?

So, if you wanna play and/or if you have any questions, let me know below.

We had the draft set for Wednesday, March 31st at 8:30pm EST.  But I'm almost certain that we're going to change that date. I think we'll need to settle on a date and a time and I think we'll need to do it sooner rather than later so as soon as we can get our ninth and tenth owners we'll restart that conversation.

I'm gonna go ahead and post an email address so that if there are any lurkers who wanna play they don't necessarily have to create an SBNation account in order to do so.  Just let me know what you're thinking: SANDBOKX.at.Gmail
Why Nathan's loss won't kill the Twins - ESPN TMI Blog
The Book -- Predicting the HR leader board
Spring Training Pitch FX Data
The All-CHONE Team

+ New FanShot All FanShots >

BtB on Twitter

Main Feed: @BtBScore

Jeff: @jeffwzimmerman
Steve: @steve_sommer
Sky: @BtB_Sky
Dan: @dturkenk
Harry: @harrypav
Jinaz: @jinazreds
Jack: @jh_moore
Erik: @Erik_Manning
Tommy R: @trancel
Justin: @justinbopp

Subscribe to BtB via Email

Enter your email address:

Delivered by FeedBurner

Most Commented

BtB Goes Social

SBNation.com Recent Stories

Chicago Cubs pitcher Angel Guzman, center, rubs the head of teammate Carlos Marmol, left, while chatting with Marmol and manager Lou Piniella, right, on the first day of baseball spring training Thursday, Feb. 18, 2010, in Mesa, Ariz. (AP Photo/Paul Connors)

SB Nation's 2010 MLB Previews: Chicago Cubs, Seeking Redemption

New York Mets starting pitchers Mike Pelfrey, left, Johan Santana, center, and Oliver Perez watch as teammates take part in drills during spring training baseball Saturday, Feb. 27, 2010, in Port St. Lucie, Fla. (AP Photo/Jeff Roberson)

SB Nation's 2010 MLB Previews: New York Mets, The High Cost Of Low Expectations

Washington Nationals pitcher Stephen Strasburg throws during the second inning of  a spring training baseball game against the Detroit Tigers Tuesday, March 9, 2010 in Viera, Fla. (AP Photo/Charlie Riedel)

Nationals Send Stephen Strasburg To Double-A Despite Impressive Spring

More from SBNation.com >


Managers

Wbc_029_small Jeff Sullivan

Editors

Rawlings_baseball_bigger_small Dan Turkenkopf

Limes_125_small Sky Kalkman

770insig_small Jeff Zimmerman (TucsonRoyal)

Aviles_small Justin Bopp

Paige_small Satchel Price

Authors

Jinaz-reds-avatar_small JinAZ

Face_small Harry Pavlidis

Newavatar_small Matt Klaassen

1753738656_110919ebe9_o_small vivaelpujols

Ozzie_small erik

Big_pun--300x300_small Tommy Rancel

Adam_small adarowski

Redcap_small SFiercex4

St_louis_cardinals_ce1141_003263_small stevesommer05

Julio_teheran_2_small PWHjort

Cclogo_small Daniel Moroz