Advanced Graphing Techniques Part 1 - Installing a Pitch FX Database on a Computer
There was discussion in another FanPost on showing people how to do some more advanced graphing techniques. I have wanted to create some heat maps myself, so I will start a series on how to do more advanced graphing and calculations.
I will step through the following 3 Parts for now and then go into other subjects.
- Setting up a pitch FX dataset and making a query.
- Installing R, linking it to the dataset and plotting some points
- Creating a heatmap using R
I will keep a general theme going and we can see how using advanced features gives us better results. We will follow how and where Joe Mauer hits home runs off right handed pitchers in 2009. Once everyone following the process is done, we will move to the next step. I am running on a PC, so there maybe different steps for those with Macs. Vivaelpujols will be following the series and I am pretty sure he has a Mac to help along.
Steps start after the jump
Steps:
- Goto: http://www.hardballtimes.com/main/article/databases-for-sabermetricians-part-one/ and complete everything until you get to the "Putting the "Data" in the Database" section.
- We will use most of the steps in the "Putting the "Data" in the Database" but instead of downloading the Databank data, download a Pitch F/X database found here. This dataset is quite large, so I would make sure you have ~10GB free to be safe in the installation process.
- Once the dataset is installed, queries can than be made. Try to get the list of pitch locations (x,y data) for HRs hit by Joe Mauer off RHP in 2009. See if you can use some of Colin's sample codes to get the data. Here is a sample query I used to get a list of HRs hit off of Randy Johnson in the strike zone:
select a.ab_id, p.px, p.pz,a.des
,a.pitcher ,a.batter
from atbats a
join pitches p on p.ab_id = a.ab_id
join games g on g.game_id = a.game_id
where a.pitcher = 116615
and g.date like '2008%' and p.px >-1.2 and p.px < 1.2
and p.pz >= 1.5 and p.pz <=3.5
and p.pitch_type ='FA'
and a.des like '%home%'
group by ab_id
I was thinking of just putting up the Mauer query, but I add data as people as it. I think it is best to work through this part individually and when you get stuck ask for help.
- Others or myself will help with the query, but each person needs to look around the Pitch FX dataset to see where the data (Pitch location, speed, batter, pitcher, umpire, etc) is located and how to get it with a query. Creating the Joe Mauer query isn't the only one that we can help people with at this time. Let us know the query you want to do and your attempt and we will help you through it. Writing queries will be useful for all datasets, so it is best to figure it out initially.
- Once everyone has a database setup and can make some basic queries, we will move on to installing R on the computer and linking it to the Pitch FX dataset.
There are no stupid questions here, except "What is Dayton Moore thinking?", so feel to ask anything.
select p.px, p.pz
from atbats a
join pitches p on p.ab_id = a.ab_id
join games g on g.game_id = a.game_id
join players pl on pl.eliasid = a.batter
where pl.last = 'Mauer' and pl.first = 'Joe'
and g.date like '2009%'
and a.event = 'Home Run'
and p.type = 'X'
39 comments
|
2 recs |
Do you like this story?
Comments
Thanks for putting this up Jeff
I am on a mac, so I can help anyone who is having trouble with the PC steps.
Error populating database
When I try to populate the database with the downloaded data, I get error message referring to incorrect SQL syntax. Help?
Did you unzip the data?
Can you post the exact syntax
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 12, 2010 8:00 AM EST up reply actions
Also Don't name the database, it will be automatically named.
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 12, 2010 8:25 AM EST up reply actions
Just right click in the SQLyog interface on root@localhost
Select restore from SQLdump and point to the unzipped .sql file
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 12, 2010 8:27 AM EST up reply actions
Shouldn't it be grouped by pitch id?
a.event = ‘home run’
p.type=‘x’
Also, the SQL dump was missing the first two games in the 2008 season and had some unimportant spring training games it as well. That’s when I downloaded it a month or two ago.
rzar.wordpress.com
draysbay.com
raysprospects.com
It can be grouped by pitch ID, but I am hoping to just use the x,y coordinates for the final heat map
I didn’t know of any games missing except the ones played in Japan.
I will look into getting the spring training games removed.
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 12, 2010 11:07 AM EST up reply actions
Here is how to get rid of the 2008 ST games:
delete from games where date like “2008-02%”
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 12, 2010 12:12 PM EST up reply actions
I already got rid of them
Although I only use the dump for the 2007 season data which has a lot of errors every time I parse it.
rzar.wordpress.com
draysbay.com
raysprospects.com
2007 is a mess, a lot of errors on the MLB side and I know people that don't even use it.
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 12, 2010 12:50 PM EST up reply actions
I don't use 07 when I have enough data for 08 and 09
If not, then I’ll use 07. As long as the errors are distributed randomly, it’s not a huge deal.
by vivaelpujols on Feb 12, 2010 2:53 PM EST up reply actions
field definitions
Thanks so much Jeff, Vivaelpujols (I am on a mac these days and you are da man!) and everyone else for making this really fun and easy! i have real limited experience with this stuff, so I appreciate the help alot. I have a pretty lame question:is there a glossary for the pitch f/x table fields anywhere? Most are quite self-explanatory, and mike fast pointed out the newer ones like pitches.sv_id, but I still get confused on pitches.×0, pitches.vx0, pitches.ax etc…thanks again for eveyone’s time and work!
Here is a decent one from Mike Fast's website:
http://fastballs.wordpress.com/2007/08/02/glossary-of-the-gameday-pitch-fields/
Let us know if there are any other questions on the categories. Maybe my next piece should be on each field if Mikes glossary is not enough
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 14, 2010 11:22 AM EST up reply actions
Thanks, Jeff.
Just checking in to say that I am following along here to learn / help out. I also participated in the Saberizing A Mac series, so I’ll especially try to address any Mac-specific questions.
So I'd just like to see how many people are interested in this
Can you reply to this comment if you are interested in following this series and working through the problems that may come up? If everyone who is could also specify how far along they are in this, that would be great as well.
I'm interested
I’ve been using SQL for a while with Rotosheet and Pitch F/X. I’ve trying to learn about the graphing aspect with R.
So I’m in the same place as Danger.
Follow me at http://twitter.com/JDSussman
Remember: baseball guys... baseball...
wow
retrosheet* lol
Follow me at http://twitter.com/JDSussman
Remember: baseball guys... baseball...
I'm interested.
Curious, Nick, what interface are you using with MySQL on the Mac? The one “we” used during the Saberizing a Mac series or something else?
Beyond the Boxscore Not a member? Sign up.
Yeah I'm interested
especially in what we can come up with on the R front.
by stevesommer05 on Feb 15, 2010 4:24 PM EST up reply actions
agreed
Using R or Gretl is what I am curious about
Follow me at http://twitter.com/JDSussman
Remember: baseball guys... baseball...
very interested
i am wicked psyched to follow along on my mac (using mamp&sequel pro)…i got the database going and R installed which i have no experience with, but i am psyched to learn how to use it with sql, thanks again!
For future reference, I figured this step was needed for reference.
It looks like we are ready to move on. Here is the query I used to get the Mauer data:
For ease of data, I removed the throws R on the requirement:
select p.px, p.pz
from atbats a
join pitches p on p.ab_id = a.ab_id
join games g on g.game_id = a.game_id
join players pl on pl.eliasid = a.batter
where pl.last = ‘Mauer’ and pl.first = ‘Joe’
and g.date like ‘2009%’
and a.event = ‘Home Run’
and p.type = ‘X’
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 15, 2010 11:14 PM EST reply actions
got an error trying this query
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘%’
and a.event = ‘Home Run’
and p.type = ‘X’’ at line 7
You can't just copy and paste this into SQL, as some of the characters don't translate properly
You should just rewrite the query straight in SQL, and it should work fine.
by vivaelpujols on Feb 16, 2010 12:46 AM EST up reply actions
I get the same one when I cut and paste.
the ’ is at an angle on the cut and paste. It is not a true quote. If you replace them all, it should run fine.
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 16, 2010 12:47 AM EST up reply actions
I put one at the end of article that works when pasted into SQLyog
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 16, 2010 12:48 AM EST up reply actions
getting an error
I’m getting the following error when I try to replicate the Joe Mauer query:
Table ‘pbp2.players’ doesn’t exist
What am I doing wrong?
On the left panel, can you expand out the list and see the tables in the dataset.
Does players exist?
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 17, 2010 11:54 AM EST up reply actions
No
The only tables I have are atbats, game_types, games, pitch_types, and pitches.
Each time I try to download the pbp2.sql file it exits about halfway through when reading. Not sure why its doing this. I’m on a mac running MAMP and Sequel Pro.
I might have to wait for Viva to reply.
Do you have plenty of memory of the computer?
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 17, 2010 12:36 PM EST up reply actions
Yeah
Memory shouldn’t be an issue. Not sure whats up, I had no problem following the first saberizing a mac series.
I've never used Jeff's SQL dump, I built it up from the XML files
Can you try clearing your database, redownloading the Zip file, and then import it and tell me what happens?
by vivaelpujols on Feb 17, 2010 2:39 PM EST up reply actions
Wasn't there a problem with the gui timing out on Macs.
It might have to be installed via the command line like retrosheet.
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 17, 2010 2:57 PM EST up reply actions
Same result
I’ve tried multiple times to re-download and import again all with the same result. Has anyone else with a mac had this problem or is it just my computer?
How can I install it via the command line?
command line command
mysql < pbp2.sql
- .-. ..- … – / – …. . / .—. .-. - .. . … …
by Jeff Zimmerman on Feb 17, 2010 7:28 PM EST up reply actions

by 





























