clock menu more-arrow no yes mobile

Filed under:

Sky's SQL Tutorial, Part One (Which is Like a David Ortiz Bunting Tutorial)

When I challenged folks to write a complex SQL query using the Baseball Databank database in the last installment of Saberizing a Mac, I anticipated we'd all start at ground level and work our way up to more complex queries until we eventually figured it out.  Well, sometimes with a lofty goal, taking all those baby steps can be overwhelming and we just see one large, overwhelming step.  I decided to start from the beginning, with the help of this SQL tutorial and write a bunch of simple SQL queries, each adding just a bit of complexity.  Not only does starting simple help, but forcing myself to write the queries instead of copying them from somewhere else helped my brain process what was going on.  All the mistakes along the way were productive.

I'm going to paraphrase the tutorial I used and show the SQL queries I came up with.  My suggestion to you is to follow along, manually typing in each and every query, and modify them a bit (different tables, different stats, different qualifications, etc.) so your learning path is as different from mine as possible.

  1. SELECT, FROM, WHERE -- SELECT tells the program what data you want to use, be it names, teams, or stats.  These are field names, which you can look up in the individual tables at the top of the columns of data.  FROM tells the program which table(s) you're pulling data from.  WHERE tells the program what conditions you're setting -- you'll almost always want to pull only certain data, not all of it.

    My first query looked like this:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR >= 50;
    It pulls the player id code (not name, that'll come next time with JOINs), the year, and number of home runs.  The data is in the Batting table.  And I only wanted to pull instances (rows) where the home runs were at least 50.  (This is not all the 50 HR seasons, though, as I'll explain later.)
  1. Some details on WHERE -- When you want to set conditions based on numbers, the WHERE statements are easy.  With text, you have some more options.  If you want to choose a specific text string, you need to surround it in quotes and you need to put AND or OR between multiple conditions.  Like this:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR >= 50 AND (teamID = "SEA" OR teamID = "NYA");
    You should see a bunch of Griffey- and Ruth-related results.  You can also pick apart pieces of text strings using LIKE.  (I believe the geeks call these "regular expressions".)  Let's say you only want to use players whose last names start with "R".  Then you'd do this:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%';
    The % sign means there could be any number of characters after the R.  If you wanted to indicate just one wildcard character, use _ (an underscore).  You can also use NOT LIKE to take the complement set, use multiple combinations of letters/%/ _, and include numbers.  For more tips, try this.

  2. ORDER BY -- Ok, nice you you have your results, but what if you want to sort them?  Use ORDER BY followed by a field name.  This makes the most sense if you order by whatever data you're pulling but I don't think it's necessary.  For example, you could sort home runs hitters by the number of doubles they hit:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%'
        ORDER BY 2B Desc;
  3. Doing calculations -- When you SELECT data, some stats you want already exist, like home runs, while some don't, like OBP.  In those cases, you have to tell the program to do the calculation and provide a name.  (You can also provide a new name to a stat without doing a calculation, like renaming 2B as "Doubles".)  Something like this, although this isn't a great definition of OBP:
    SELECT playerID, yearID, HR, (H+BB+HBP)/(AB+BB+HBP) AS "OBP"
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%'
        ORDER BY 2B Desc;
    There are also some built-in calculations you can use, like MIN, MAX, SUM, AVG, and more.

  4. GROUP BY -- The data in the data tables is extremely granular, probably more granular than we want most of the time.  For example, in the Batting table, the same hitter in the same season could have multiple entries, even for the same team.  All "stints" (consecutive games played for the same team) are separated.  So if a player was a Giant, then an Oriole, then a Giant again, he'd have three separate lines.  Two lines would have stint listed as 1, while the second stint for the Giants would have stint listed as 2.  If you want all of a player's season data aggregated, or all of his data for the same team aggregated, you need to tell the program to do just that.  Hence, GROUP BY.  If you only group by playerID, though, it will combine all seasons.  So you need to list all the ways you want the groupings to be unique.  I'm using player and season below, but you might also want to separate performance by team:
    SELECT playerID, yearID, HR
        FROM Batting
        WHERE HR > 40 AND playerID LIKE 'R%'
        GROUP BY playerID, yearID
        ORDER BY HR Desc;

Ok, that's as far as I've gotten for now.  I just took a peak at Colin's two SQL tutorials and he has similar stuff, plus more, so go check those out.  I'll definitley continue with part two soon.