/cdn.vox-cdn.com/uploads/chorus_image/image/7216353/20120505_jla_aq4_167.0.jpg)
I had a few requests this weekend for assistance creating year-to-year correlation tests, so I felt this may be a helpful topic to cover for anybody looking to sharpen their SQL skills. This is surprisingly much more simple than it may seem, so I urge you not to be intimidated. Today we'll cover the two stages to this process:
- Generating year1/year 2 data from SQL.
- Test for correlation between two sets of data in excel (or excel-type programs).
Year 1/Year 2 with INNER JOINS
In order to gather data from two different but consecutive seasons, we begin by essentially create two separate queries for each year. So, if we are trying to pull information on a pitchers' ERAs in one year and the subsequent year since 2004, for instance, we then need to draw up two nearly identical queries. Label them 'a' and 'b' by placing each query inside parenthesis with the letter 'a' and 'b' immediately after the closing parenthesis.
(SELECT p.playerID, p.yearID
, ROUND(p.ipouts/3,1) as IP
, ROUND(p.er/(p.ipouts/3)*9,2) as ERA
FROM pitching p
WHERE p.yearID >= 2004
GROUP BY playerID, yearID
HAVING IP >= 162)a
(SELECT p.playerID, p.yearID
, ROUND(p.ipouts/3,1) as IP
, ROUND(p.er/(p.ipouts/3)*9,2) as ERA
FROM pitching p
WHERE p.yearID >= 2004
GROUP BY playerID, yearID
HAVING IP >= 162)b
Then we need to join the two subqueries by placing an INNER JOIN command between them. In order to do this we must also let SQL know how the two are related to one another. So we must also create an 'ON' clause in which we define that the playerIDs in query 'a' are the same as those in query 'b' and that the year in subquery 'b' is one year immediately after that in subquery 'a':
(SELECT p.playerID, p.yearID
, ROUND(p.ipouts/3,1) as IP
, ROUND(p.er/(p.ipouts/3)*9,2) as ERA
FROM pitching p
WHERE p.yearID >= 2004
GROUP BY playerID, yearID
HAVING IP >= 162)a
INNER JOIN
(SELECT p.playerID, p.yearID
, ROUND(p.ipouts/3,1) as IP
, ROUND(p.er/(p.ipouts/3)*9,2) as ERA
FROM pitching p
WHERE p.yearID >= 2004
GROUP BY playerID, yearID
HAVING IP >= 162)bON a.playerID = b.playerID
AND a.yearID = b.yearID - 1;
Once we've defined the relationship between them, we can then SELECT the data we want FROM both subqueries at the same time:
SELECT a.playerID
, a.yearID, a.IP, a.ERA
, b.yearID, b.IP, b.ERA
FROM
(SELECT p.playerID, p.yearID
, ROUND(p.ipouts/3,1) as IP
, ROUND(p.er/(p.ipouts/3)*9,2) as ERA
FROM pitching p
WHERE p.yearID >= 2004
GROUP BY playerID, yearID
HAVING IP >= 162)a
INNER JOIN
(SELECT p.playerID, p.yearID
, ROUND(p.ipouts/3,1) as IP
, ROUND(p.er/(p.ipouts/3)*9,2) as ERA
FROM pitching p
WHERE p.yearID >= 2004
GROUP BY playerID, yearID
HAVING IP >= 162)bON a.playerID = b.playerID
AND a.yearID = b.yearID - 1;
Run this query and you should return 399 pairs of pitching-seasons with at least 162 innings in both years spanning from 2004-2012.
Calculating 'r'
Now that we have our data, we must transfer it to our excel-type program to find the correlation between ERA in year 1 (subquery a) and year 2 (subquery b). always have to remind myself to also copy the column names along with the rest of the data which in SQLpro is achieved via "File"/"Copy with Column names".
Paste into your excel sheet. (I personally use the 'Numbers' program, but OpenOffice is free.) Find an empty cell you want to run your correlation on your year 1 ERA from column 'D' and the year 2 ERA from column 'G'. The formula should look like this then (at least, according to my 'Numbers' program. It may different with other applications):
=CORREL(d2:d400,g2:g400)
You should get r = .38 which would perfectly match Bill Petti's findings on the matter in his oft-cited article on which pitching metrics correlate year to year from a year ago.
So that's it. Nothing to it. Now go off and try to develop the next great ERA estimator and become famous. (Special bonus points for the first person to beat out Glenn DuPaul's pFIP.)
As always, don't be afraid to ask the most basic of questions and someone from the community or myself will do the best to help you.
More from the Saberizing a Mac series:
Thanks again to the Sean Lahman for providing us with the means to do all of this.
Follow @JDGentile on twitter.