Some odd results from a multivariate regression
After I wrote this post on my blog regarding FIP and it's correlation to UZR, I got an idea to run a multivariate regression using each position independently (excluding C and P, since there's no UZR data for them), and I got some interesting results, and I'm not really sure if I can either (A) trust the data or (B) if I'm interpreting it correctly, so I thought I'd post it here.
Dependent Variable: TotalRunDiff (or TRD) = (IP/9) * (FIP - ERA)
This is the difference in earned runs projected by FIP and actual earned runs.
Independent variables: 1Buzr, 2Buzr, 3Buzr, SSuzr, LFuzr, CFuzr, RFuzr
The UZR for each team by position.
I input data for all 30 teams in 2008. Here's the equation the regression analysis spit out:
TRD = .048 + 2.12*1Buzr + (-.10)*2Buzr + 1.60*3Buzr + .70*SSuzr + .02*LFuzr + 1.66*CFuzr + .60RFuzr
The correlation was pretty strong; r = .8063.
This seems to imply that the most important positions, in order, are 1B, CF, 3B, SS, RF, LF, and 2B, with good defense at 2B actually having a slightly negative effect on a team (which doesn't make any sense, but this is why I will run more regressions on other seasons besides 2008).
Just wondering if anybody had any input on this.
0 recs |
18 comments
Comments
The second base data could have been thrown off...
If teams with good 2B UZR in 2008 all had bad TRD, it could adversely affect your data. Of course, this would mean that most of the bad TRD teams had good UZR , and that good TRD teams had bad UZR. I don’t think it accurately reflects 2B’s impact on TRD.
A larger sample size could fix this problem.
by NoNameOnCard on Mar 4, 2009 2:54 PM EST reply actions 0 recs
Where did you get your positional team data?
The Fangraph leaderboards don’t divide up production between teams for a given player. Did you go team by team?
Beyond the Boxscore // Calling BJ Upton lazy is lazy.
by Sky Kalkman on Mar 4, 2009 2:59 PM EST reply actions 0 recs
I went to fangraphs => teams => fielders => position
I guess that might mess things up a little bit, but I don’t think it would be that large of an issue, would it?
---
Juuuust a bit outside!!
http://www.rightfieldbleachers.com
by Jack Moore on Mar 4, 2009 3:03 PM EST up reply actions 0 recs
So you downloaded seven sets of data?
Yeah, that should work just fine.
Multiple seasons would be good, obviously.
Beyond the Boxscore // Calling BJ Upton lazy is lazy.
by Sky Kalkman on Mar 4, 2009 3:27 PM EST up reply actions 0 recs
One thing to look at is the Standard Deviations between datasets (2B UZR, etc)
For example if the 2nd base the numbers are near 0 could mean that all team’s are getting the same play from their 2nd basemen, so it’s value doesn’t really matter.
I am wondering how well the positional S.D. correlates the positional multiplier in your equation.
by Jeff Zimmerman (TucsonRoyal) on Mar 4, 2009 3:34 PM EST reply actions 0 recs
SDs by position:
1B. .|..2B…|..3B|.SS..|….LF….|..CF..|…RF
5.68 | 8.48 | 9.8 | 9.83 | 11.22 | 10.61 | 15.63
---
Juuuust a bit outside!!
http://www.rightfieldbleachers.com
by Jack Moore on Mar 4, 2009 6:03 PM EST reply actions 0 recs
Just nothing there
I was also thinking the left side might be more important since most people are right handed, but it only applies to infield, but not to outfield.
I also looked at chances and that doesn’t help explain 2nd
“first and third baseman get around 1.5 chances per game, CF, 2B and SS, 2.5, and RF and LF, 2.0.” -MGL
If you remove 2nd base from the regression, what happens to r-sqared?
by Jeff Zimmerman (TucsonRoyal) on Mar 4, 2009 6:41 PM EST up reply actions 0 recs
Sometimes multiple regression is simply wrong.
It’s a very crude tool.
Some suggestions, however:
- Use multiple years of data.
- Look at all runs, not just earned runs.
- Consider removing the constant.
by cwyers on Mar 4, 2009 8:43 PM EST reply actions 0 recs
p-values
You might consider double-checking the p-values of each individual term to see if any (i.e. 2B) could be considered insignificant contributors to the dependent variable. Just a thought…
by jrfischer on Mar 5, 2009 11:53 AM EST reply actions 0 recs
wait
did you run a regression on 7 independent variables using 30 observations?
by Matt Swartz on Mar 5, 2009 9:00 PM EST reply actions 0 recs
hello?
just to clarify, running a regression with seven independent variables and for only thirty observations is useless. if that’s what you did, it’s not even worth analyzing this. you might as well just summarize the individual players. for seven regressors, you should have 150-250 observations to be safe, i’d say. nothing much short of that.
by Matt Swartz on Mar 7, 2009 10:13 AM EST up reply actions 0 recs
So, 5-8 seasons' worth?
UZR’s available for seven at Fangraphs, right?
Beyond the Boxscore // Calling BJ Upton lazy is lazy.
by Sky Kalkman on Mar 7, 2009 10:32 AM EST up reply actions 0 recs
OK, when I get a chance I’ll add the other seasons. Might not be for a bit as I have a packed week coming up.
---
Juuuust a bit outside!!
http://www.rightfieldbleachers.com
by Jack Moore on Mar 7, 2009 1:31 PM EST up reply actions 0 recs
Further proof that Pujols is the most valuable player in the game
vivaelbeñsheets
by vivaelpujols on Mar 6, 2009 1:09 AM EST reply actions 0 recs
datum
It looks to me more like a measure of the variability in quality of the defender between teams at that position – rather than importance of the position.
1B can be slow-footed non-athletes, or extremely athletic. >>> 2B are remarkably similar, athletic, good glove – average number of plays handled.
Go away! Guys, you're gonna wake up my Mom!
by David Howards Legacy on Mar 6, 2009 4:15 PM EST reply actions 0 recs
Actually
Looking at the spread of UZR talent by position, OF is by far the most varying.
vivaelbeñsheets
by vivaelpujols on Mar 9, 2009 10:27 PM EDT up reply actions 0 recs

by 















BtB on Facebook














