Beyond the Box Score: An SB Nation Community

Navigation: Jump to content areas:


Sports blogs for fans, by fans.
Around SBN: USA Today / SB Nation Consensus MMA Rankings for October

Sky's SQL Tutorial, Part One (Which is Like a David Ortiz Bunting Tutorial)

When I challenged folks to write a complex SQL query using the Baseball Databank database in the last installment of Saberizing a Mac, I anticipated we'd all start at ground level and work our way up to more complex queries until we eventually figured it out.  Well, sometimes with a lofty goal, taking all those baby steps can be overwhelming and we just see one large, overwhelming step.  I decided to start from the beginning, with the help of this SQL tutorial and write a bunch of simple SQL queries, each adding just a bit of complexity.  Not only does starting simple help, but forcing myself to write the queries instead of copying them from somewhere else helped my brain process what was going on.  All the mistakes along the way were productive.

I'm going to paraphrase the tutorial I used and show the SQL queries I came up with.  My suggestion to you is to follow along, manually typing in each and every query, and modify them a bit (different tables, different stats, different qualifications, etc.) so your learning path is as different from mine as possible.

  1. SELECT, FROM, WHERE -- SELECT tells the program what data you want to use, be it names, teams, or stats.  These are field names, which you can look up in the individual tables at the top of the columns of data.  FROM tells the program which table(s) you're pulling data from.  WHERE tells the program what conditions you're setting -- you'll almost always want to pull only certain data, not all of it.

    My first query looked like this:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR >= 50;
    It pulls the player id code (not name, that'll come next time with JOINs), the year, and number of home runs.  The data is in the Batting table.  And I only wanted to pull instances (rows) where the home runs were at least 50.  (This is not all the 50 HR seasons, though, as I'll explain later.)

Star-divide

  1. Some details on WHERE -- When you want to set conditions based on numbers, the WHERE statements are easy.  With text, you have some more options.  If you want to choose a specific text string, you need to surround it in quotes and you need to put AND or OR between multiple conditions.  Like this:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR >= 50 AND (teamID = "SEA" OR teamID = "NYA");
    You should see a bunch of Griffey- and Ruth-related results.  You can also pick apart pieces of text strings using LIKE.  (I believe the geeks call these "regular expressions".)  Let's say you only want to use players whose last names start with "R".  Then you'd do this:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%';
    The % sign means there could be any number of characters after the R.  If you wanted to indicate just one wildcard character, use _ (an underscore).  You can also use NOT LIKE to take the complement set, use multiple combinations of letters/%/ _, and include numbers.  For more tips, try this.

  2. ORDER BY -- Ok, nice you you have your results, but what if you want to sort them?  Use ORDER BY followed by a field name.  This makes the most sense if you order by whatever data you're pulling but I don't think it's necessary.  For example, you could sort home runs hitters by the number of doubles they hit:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%'
        ORDER BY 2B Desc;
  3. Doing calculations -- When you SELECT data, some stats you want already exist, like home runs, while some don't, like OBP.  In those cases, you have to tell the program to do the calculation and provide a name.  (You can also provide a new name to a stat without doing a calculation, like renaming 2B as "Doubles".)  Something like this, although this isn't a great definition of OBP:
    SELECT playerID, yearID, HR, (H+BB+HBP)/(AB+BB+HBP) AS "OBP"
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%'
        ORDER BY 2B Desc;
    There are also some built-in calculations you can use, like MIN, MAX, SUM, AVG, and more.

  4. GROUP BY -- The data in the data tables is extremely granular, probably more granular than we want most of the time.  For example, in the Batting table, the same hitter in the same season could have multiple entries, even for the same team.  All "stints" (consecutive games played for the same team) are separated.  So if a player was a Giant, then an Oriole, then a Giant again, he'd have three separate lines.  Two lines would have stint listed as 1, while the second stint for the Giants would have stint listed as 2.  If you want all of a player's season data aggregated, or all of his data for the same team aggregated, you need to tell the program to do just that.  Hence, GROUP BY.  If you only group by playerID, though, it will combine all seasons.  So you need to list all the ways you want the groupings to be unique.  I'm using player and season below, but you might also want to separate performance by team:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%'
        GROUP BY playerID, yearID
        ORDER BY HR Desc;

Ok, that's as far as I've gotten for now.  I just took a peak at Colin's two SQL tutorials and he has similar stuff, plus more, so go check those out.  I'll definitley continue with part two soon.

3 recs  |  Comment 30 comments |

Story-email Email Printer Print

Around SB Nation

New FanShots feature

Mar 2008 from Pinstripe Alley - 1 comment

Comments

Display:

For those of you not well versed in SQL

SQL for Dummies is a great book,,,

Here is a pdf of an Access Turtorial

This is a link to the SQL reserved words

Hope these help…

JD’s like, "you want some fucking pitching? Here’s all the pitching you can stand. Now choke on it, bitches!"- RCCook

by laxtonto on Jun 22, 2009 3:57 PM EDT reply actions   0 recs

This is worthless. Where the hell are your notes about stored procedures? Preventing SQL Injections? Many-to-Many relationships?

VIEWS? YOU DIDN’T COVER VIEWS?

Also, good work.

"Where we all wait in earnest with pudding in hand for the Upton comet to sail through the roofed skies, so that we may meet Him."

by kericr on Jun 22, 2009 4:57 PM EDT reply actions   0 recs

To anybody who actually feels compelled to ask a question about what I'm talking about, don't bother. It actually has absolutely nothing to do with what Sky is trying to show you.

"Where we all wait in earnest with pudding in hand for the Upton comet to sail through the roofed skies, so that we may meet Him."

by kericr on Jun 22, 2009 4:58 PM EDT up reply actions   0 recs

So, with a little look ahead to JOINs over at Colin's tutorials, I have solved my challenge.

Weighted team age using Win Shares, 2008. Anyone want to verify my result? I won’t give it away quite yet. Maybe I’ll walk through it in part two of the really really basic tutorial.

teamID yearID Result
NYA 2008 32.81753466
HOU 2008 31.82842758
PHI 2008 30.88740043
BAL 2008 30.2372549
CHA 2008 30.1917603
TOR 2008 30.1346527
CHN 2008 30.09446925
DET 2008 30.0859586
SDN 2008 29.95824524
NYN 2008 29.89625468
BOS 2008 29.85513855
SEA 2008 29.83378896
SLN 2008 29.64277412
SFN 2008 29.40037071
MIL 2008 29.32814238
CIN 2008 29.05177848
PIT 2008 28.96368159
LAA 2008 28.91042291
LAN 2008 28.75852498
TEX 2008 28.50696496
CLE 2008 28.46207749
ARI 2008 28.2895486
COL 2008 28.18198198
ATL 2008 27.918443
FLO 2008 27.91706349
WAS 2008 27.90847458
KCA 2008 27.86438417
TBA 2008 27.42499142
OAK 2008 27.38211744
MIN 2008 26.9341408

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 5:41 PM EDT reply actions   0 recs

Youngest team since 1950, weighting by production as measured by WS: 1967 Kansas City A's at 24.4 years

Oldest was the 2006 Giants at 33.8 years old. Second and third place belong to the 2005 and 2004 Yankees.

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 5:48 PM EDT up reply actions   0 recs

I will add...

that I simply defined age as Year minus Birth Year, no July 1st hocus pocus or partial year calculation (which is probably the most exact way to do it.)

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 6:06 PM EDT up reply actions   0 recs

A tip on weighted averages...

You multiply each thing by it’s weight, add it all up, then divide by the sum of all the weights.

For example, a 5/4/3 hitting Marcel for OBP would look like this:

2010_OBP = (2007_OBP x 3 + 2008_OBP x 4 + 2009_OBP x 5) / (3 + 4 + 5)

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 7:56 PM EDT up reply actions   0 recs

Some tips I learned today

Whatever you’re doing, restrict your year to one to start. It’ll speed up the query. Also, if you’re joining, declare the join immediately, so you don’t end up doing a cartesian join. And if using Sequel Pro, save your queries in a text file so if you screw up and it crashes from the heavy work load, youre query is saved.

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 5:43 PM EDT reply actions   0 recs

lol --- that one is learned the hard way too often

Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.

by Jeff Zimmerman (TucsonRoyal) on Jun 22, 2009 5:49 PM EDT up reply actions   0 recs

Thank you for continuing on this stuff

Did you (or anybody) ever find how to do: “For all teams to make the playoffs since, oh, 1980, where did they rank within their league in total homeruns?”

I’m still working on it, when I have time.

by RFK on Jun 22, 2009 6:27 PM EDT reply actions   0 recs

Good question.

I’m a webdev by day so I have a decent command of SQL, but I’m by no means an expert. If you have a “how would I do this?” moment and you can reduce your problem to the simplest terms (as in “how do I extract the rankings of set of records where all I have is a column containing the sum of some property”… in this case HR) this is a really great source. I had to use it for this one (took me about an hour to figure it out):

http://www.artfulsoftware.com/infotree/queries.php?&bw=1243

Here is your solution.

select battingpost.teamid, battingpost.yearid, battingpost.lgid, rank.rank from battingpost
join (
  select b1.teamid, b1.yearid, count(b2.total_hr) as rank
  from (select teamid, yearid, sum(batting.hr) as total_hr, lgid from batting group by teamid, yearid) as b1
  join (select teamid, yearid, sum(batting.hr) as total_hr, lgid from batting group by teamid, yearid) as b2
    on (
      (b1.total_hr < b2.total_hr and b1.yearid = b2.yearid and b1.lgid = b2.lgid) OR
      (b1.total_hr = b2.total_hr and b1.yearid = b2.yearid and b1.teamid = b2.teamid and b1.lgid = b2.lgid)
  )
  where b1.yearid >= 1980
  and b2.yearid >= 1980
  group by b1.teamid, b1.yearid
) as rank on (rank.teamid = battingpost.teamid and rank.yearid = battingpost.yearid)
where battingpost.yearid >= 1980
group by battingpost.teamid, battingpost.yearid
order by yearid asc, rank asc

There might be an error in the deepest table subquery, but it looks like the data coming back is accurate. FWIW, if I were computing this in application I were working on, I’d actually do it in multiple steps for the sake of clarity.

kling klang king of the rim ram room

by chief of the fuzzers on Jun 22, 2009 10:58 PM EDT up reply actions   0 recs

Wow

I didn’t even have the “column containing the sum of some property” part figured out properly yet. (And it looks like it all checks out okay to me, too.) Thanks, now I can start obsessing about something else…

by RFK on Jun 22, 2009 11:40 PM EDT up reply actions   0 recs

Trying to break this down in English...

In the middle you’re finding the HR totals for each team-season (twice). By joining that sub-query to itself in some creative way, you’re then finding where each team-season ranked within their own season and league (since 1980). Then you’re joining those results with the postseason info so you’re only looking at teams who made the post-season.

I get most of that, I just don’t get what’s going on in the “creative join” where you actually find the rank. A little help would be appreciated?

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 23, 2009 9:04 AM EDT up reply actions   0 recs

Sure

I have some examples here to illustrate what’s going on, but SBN doesn’t format monospace very well so I’m linking to them.

First off, I’ll simplify the schema a bit.

Say you have a table that looks like this (which is similar to what we have in the innermost set of subqueries):

base set

If you join that against another identical set based on total_hr equality OR the condition “total_hr is less than the join table’s total_hr” you get a set that looks like this:

joined set

So if you were insert DET into the original table with an HR count of 185, BAL would then be joined 3 times, one for itself, one for TEX, one for DET.

You then count the occurrences of BAL, and that’s its rank, because it joined against the number of other teams with a higher HR count, plus itself, then group by b1.teamid to return an abbreviated set.

kling klang king of the rim ram room

by chief of the fuzzers on Jun 23, 2009 12:49 PM EDT up reply actions   0 recs

Awesome thanks. I get it now.

My translation, if it helps anyone else:

For each team, a row is created via the join for every team they’re tied with in home runs or have fewer home runs than, including themselves. The number of rows created for each team is their rank. The best team will only appear once, because they tie themselves. The worst team of 15 will appear 15 times, because they have fewer home runs than 14 teams and tie themselves. Of course, we’re only counting when a team appears in the left most column, not the third column.

If you want a team’s rank where lower is better, switch “less than” to “greater than”. Now, this doesn’t handle ties like golf prize money. How would you do that? Actually, I’ve seen the technique done in Excel before. I’ll look for it.

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 23, 2009 10:33 PM EDT up reply actions   0 recs

Cool Sky

I’ve already played around with mine already, and I can do basic things. One thing I’m having trouble with is doing separate calculations, like finding the average batting average by decade.

St. Louis relievers... defying win expectancy since 2008
http://www.drivelinemechanics.com/

by vivaelpujols on Jun 22, 2009 9:03 PM EDT reply actions   0 recs

I'd try some modulo and/or floor functions to define "decade" in the SELECT statement...

then group AB and H by decade. of course, then you have to do the division in Excel. There must be a way to code it all…

SQL math functions:
http://sqlcourse2.com/math.html

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 9:19 PM EDT up reply actions   0 recs

You can do it in a single Select

select concat(left(yearid,3),“0s”) as decade, round(sum(h)/sum(ab),3) as batting_average from lahman.batting group by left(yearid,3)

Although I cheated by not using math to identify the decades

by Dan Turkenkopf on Jun 22, 2009 9:37 PM EDT up reply actions   0 recs

Cheater.

Although, I like the cheat method to defining the decade.

In your GROUP BY statment, couldn’t you just say “decade” instead of doing left(yeaid,3)?

Of course, if we’re being sticklers for the definition of decade, you’d need math, or do what you did after adding 1 to each year. (The last decade in the previous millenium was 1991 through 2000).

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 9:48 PM EDT up reply actions   0 recs

Yeah I know that's the real way for decades, but no one actually treats them like that

And yes, you could group by decade in that example.

What you can’t do is use ‘decade’ in a where clause. You can’t use an aliased (defined by the ‘as’) field in a where clause. You either need to use the calculation or add a ‘having’ clause to the end.

So this would fail:
select concat(left(yearid,3),“0s”) as decade, round(sum(h)/sum(ab),3) as batting_average from lahman.batting where decade = ‘2000s’ group by decade

but this would succeed:

select concat(left(yearid,3),“0s”) as decade, round(sum(h)/sum(ab),3) as batting_average from lahman.batting where concat(left(yearid,3),“0s”) = ‘2000s’ group by decade

as would this:

select concat(left(yearid,3),“0s”) as decade, round(sum(h)/sum(ab),3) as batting_average from lahman.batting group by decade having decade = ‘2000s’

by Dan Turkenkopf on Jun 22, 2009 9:53 PM EDT up reply actions   0 recs

And you guys will be really dangerous once you start nesting select statements

For example:

select namelast, namefirst, sum(a.h) / sum(a.ab) as batting_average from lahman.master m join
(select b.playerid, b.yearid, sum(b.h) as h, sum(b.ab) as ab
  from lahman.batting b
  join lahman.fielding f
    on b.playerid = f.playerid and b.yearid = f.yearid and f.pos = ‘SS’
    group by b.playerid) a
on m.playerid = a.playerid
group by a.playerid

This figures out the career batting average for players in years they played shortstop. Not a very useful example, but it should illustrate the possibilities.

Of course you may already know this and then you can ignore me.

by Dan Turkenkopf on Jun 22, 2009 9:48 PM EDT reply actions   0 recs

I was actually just reading about these in my geek book, so thanks for a relevant example

21 days from now I’ll be a SQL expert, evidently…

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 22, 2009 9:51 PM EDT up reply actions   0 recs

Ahhh, so the outpout of that nested SELECT statement is a "table" of data...

… with the columns of whatever you’re SELECTing (in this case playerID, yearID, total hits, and total ab) and the rows defined by the ON statement as part of the join. And then that “table” (which uses both unchanged data like player and year and aggregated/calculated data like total h and total ab) is joined with another table, or the same table in a different way. This is how you tackle multi-step stuff, huh?

And the results of the nested SELECT are put into a temporary “table variable”, a in this case, just like you’d label a regular table in a query. Cool.

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 23, 2009 8:27 AM EDT up reply actions   0 recs

Actually Sky

David Ortiz has a career .571 batting average on bunts… so you must be really good at this SQL stuff.

St. Louis relievers... defying win expectancy since 2008
http://www.drivelinemechanics.com/

by vivaelpujols on Jun 23, 2009 8:12 AM EDT reply actions   0 recs

Nice work & discussion

I’ve been following Colin’s lead and using implicit joins instead of explicit joins, so I’ll have to revisit those to completely parse everything here. But there are a ton of little things here and there that I picked up here, even if i was already marginally competent at doing the basic things people have asked to do.
-j

by JinAZ on Jun 23, 2009 10:13 PM EDT reply actions   0 recs

Question:

I want to modify Colin’s Marcels code to spit out “projections” for all seasons, not just 2009. For example, given Barry Bonds’ 1998, 1999, and 2000 seasons, what would have been his projection from 2001.

Colin hard-coded the years 2009, 2008, 2007, 2006, and 2005 into the code. Can I replace them with variables somehow, letting X equal 2009 for pass one (and identifying 2008 as year X-1, etc.) and then cycle through a bunch of previous seasons letting X=2008, 2007, etc., all the way back until, say 1980 or 1950? I.e. can I use a while loop? Or do I need to approach this differently and re-write the SQL query significantly?

Beyond the Boxscore // Calling BJ Upton lazy is lazy.

by Sky Kalkman on Jun 24, 2009 8:57 AM EDT reply actions   0 recs

Comments For This Post Are Closed


User Tools

We use numbers and stuff.
Community Guidelines
Why be a member?
Start posting on Beyond the Box Score »

Join SB Nation and dive into communities focused on all your favorite teams.

FanPosts

Community blog posts and discussion.

Recent FanPosts

Small
PZR-based Win Values 2001-2006
Small
The "30 parks on a budget" challenge
Sunflower_small
World Series Simulation, Game #6
Small
JT20 Dynasty League
E52205a2_small
New Look
Sth70021_small
Exploring Hit f/x, Albeit Badly
Redcap_small
Ricky Nolasco: 4 WAR or 1 WAR?
Redcap_small
Apparently I can't do park adjustments
Small
Which tells us more: The last 7 at bats or 7 at bats against this pitcher?
Sleepy_jeff_small
How Efficient and Effective Were the Rockies in 2009?

+ New FanPost All FanPosts >

FanShots

Quick hits of video, photos, quotes, chats, links and lists that you find around the web.

Recent FanShots

The Mistake Lottery
On the Field, the Yankees Are The Team of the Decade. Off It? The Red Sox.
Tigers' all-time WAR leaders
Primer on Runs Created
How to improve basketball
LB Keith Bulluck uses a sabermetric analogy to explain the Titans' quarterback situation.
Alcides Escobar "abandoned his daughter before she was born"
UZR, Scouting, and the Fans
Not-So-Lousy Lineup Optimizer, Playoff Edition: New York Yankees

+ New FanShot All FanShots >

BtB on Twitter

Main Feed: @BtBScore

Tommy B: @tommy_bennett
Sky: @BtB_Sky
Dan: @dturkenk
Harry: @harrypav
Jinaz: @jinazreds
Jack: @jh_moore
Erik: @Erik_Manning
Tommy R: @trancel
Justin: @justinbopp

Subscribe to BtB via Email

Enter your email address:

Delivered by FeedBurner

Most Commented

Limes_125_small
Time To Move On
770insig_small
Negative Team WAR - 2009 Edition
Nando_small
A Complete and Lenghty List of Baseball-Related Things Miguel Olivo is Good At
Aviles_small
Minnesota: Fielding TargetView Before & After JJ Hardy
E52205a2_small
New Look

Managers

Nando_small R.J. Anderson

Limes_125_small Sky Kalkman

E52205a2_small Tommy Bennett

Editors

Face_small Harry Pavlidis

Rawlings_baseball_bigger_small Dan Turkenkopf

770insig_small Jeff Zimmerman (TucsonRoyal)

Aviles_small Justin Bopp

Authors

Banny_small erik

Raysring1_small Tommy Rancel

Jinaz-reds-avatar_small JinAZ

Jmlogo_small Jack Moore

1753738656_110919ebe9_o_small vivaelpujols

1_small Graham

Baseball_small Mike Rogers

Redcap_small SFiercex4

Small Patrick Clark

Walter_album_small Walter Fulbright