Geographic Bias in the Amateur Draft: Part 2.1

Alex Smith and I have previously published work at The Hardball Times showing evidence of a geographic bias in the amateur player draft. I now revisit that research seeking to address some potential issues with our original methodology. Specifically I will apply some regression techniques to model draft success adding in extra parameters to test our initial findings.

This article is an extension of prior work on geographic bias at The Hardball Times and is based upon research I presented at the annual SABR Analytics conference. This is part one of a two part series.

Introduction

In December of last year, Alex Smith and I published a study on geographic bias in the amateur player draft at The Hardball Times. We found evidence that players from known amateur baseball powerhouses such as California were consistently outperforming their peers from northern states and reaching the majors at a higher rate. In examining possible causes for this geographic imbalance we identified three main factors:

1. Behavioral reasons: Teams could be unwilling to invest too heavily on one region.

2. Geographic/Cultural reasons: Players in power states often play baseball year round and receive higher quality coaching and face higher quality competition. As a result they have a more advanced approach and are less raw entering professional ball.

3. Structural reasons: National showcase events that select a fixed number of amateurs from each region implicitly bias against players from states with deeper talent pools.

The research was a proof of concept, but our methodology had issues that I seek to address here. In this piece I will briefly review our prior work and discuss the reasons for extending the research. I have added two primary sections--the first is explicit modeling of draftee success. Next week I will cover the spatial analysis component of my research. If you are familiar with the prior research or are just feeling antsy feel free to skip to the section titled "Methods" and if you are especially impatient feel free to proceed straight to the "Results" section.

Prior Research and Motivation For Extension

To briefly summarize, we split players into six groups based on hitter/pitcher and high school/college/junior college. Our primary methodology was to compare the share of players from a particular group and state being drafted to the share of that same group and state reaching the majors. For example, California high school position players represented 23.2% of all high school position players drafted in our sample. However, when we look at all high school position players in our sample making the majors, Californians represent 32.1% of them. We interpret this large negative difference to mean Californian high school position players are undervalued. If geography did not matter at all, in large samples we would expect the difference between these two shares to be close to 0%. We found that baseball powerhouses like California, Texas, and Florida and to a lesser extent Arizona and Georgia to be the most consistently undervalued states.

Our initial study primarily served to show that there is variation in player success rates state to state, but it could be improved upon. Our methodology fell short in three primary areas.

1. Lack of explicit control for draft position: Not controlling for draft position was the most important issue with the initial study. Though we reported average draft position for each group, a simple mean is not an adequate measure in this context. To stick with our example of California if we have ten players being drafted in the first round and ten in the fifteenth we would expect that group as a whole to perform better than a group of twenty players drafted in the seventh round. In this study we will explicitly control for this.

2. Inclusion of late rounds (up to round 15): Similarly including later rounds could skew our results. In the first five rounds teams are carefully selecting guys they have scouted multiple times. By the mid teens teams may be drafting to satisfy a particular area scout or to pay a favor to the GM’s nephew. This time around we will limit our sample to just the first five rounds.

3. Rudimentary measure of over/under valued: Our measure to determine over/undervalued, the share difference, was a good measure for determining the direction of the bias and potentially the magnitude as well. However, the metric lacked a meaningful interpretation. By modeling draft success we will develop a result with a useful interpretation.

4. Simple use of spatial analysis: Our first attempt at spatial analysis simply looked at the locations of area scouts versus draftees. In this study we will use more advanced spatial techniques to determine if teams are scouting optimally.

To address these issues we will construct two models to measure draft success. We will use the framework set forth by Sky Andrecheck at Baseball Analysts and create a model to predict probability of reaching the majors as well as expected career WAR. Next week we will use more advanced spatial analysis to examine the distribution of area scouts and top draftees throughout the country.

Methods

Before getting into the specifics of our two models I will lay out some definitions and assumptions. First, we will define the five states mentioned above, California, Texas, Florida, Georgia, and Arizona, as power states. Our data set covers the top 150 picks (the first five rounds more or less) for the 10-year span between 1997 and 2006. This time around we will include junior college draftees in the college group. This simply allows us to stratify our data into fewer groups. Finally, players drafted twice within the sample have been included twice. This means if a high school player was drafted in the top five rounds, but didn’t sign and later got drafted out of college in the top five rounds he will be included twice. Excluding duplicate players did not change the results very much and reasonable people could disagree on what to do here, but I decided to go with it so we weren’t introducing an extra bias in the results.

In order to predict career WAR we must allow for a player to have enough time to play out their entire career. However, we don’t have that much time: we want to be able to use more recent data. We will again follow the framework laid out by Sky Andrecheck and upboost player’s "career" WAR as such. We will assume players drafted from 1997-2001 have accumulated all of their career WAR. We will then upboost the rest of the players’ WAR numbers as such:

2002: have accumulated 90% of their career WAR

2003: have accumulated 80% of their career WAR

2004: have accumulated 70% of their career WAR

2005: have accumulated 60% of their career WAR

2006: have accumulated 50% of their career WAR

This tradeoff lets us examine more recent years, but we can still only go up to 2006 before we’re too close to the present.

We will partition the players in our data set into eight classes based on hitter/pitcher, high school/college, and power state/regular state. We can look at the share of the draft each of these eight groups represents over time:

We can see from these charts that this partitions the data into roughly equal segments. We can also see that there does not appear to be any significant shift in strategy in terms of one class of player being drafted significantly more or less over time.

Our two models will both take in virtually the same set of predictors: draft position and then a series of indicators for our eight classes of player. College hitters from power states are our reference group.

Result = (a+

fI(College Bat, Regular State) +

I(HS Bat, Power State) +

I(HS Bat, Regular State) +

I(College Pitcher, Power State) +

I(College Pitcher, Regular State) +

I(HS Pitcher, Power State) +

I(HS Pitcher, Regular State)

*Pick^b)

The difference is our WAR model will estimate a power coefficient to raise draft pick to. Our logistic regression will simply use a transformed version of draft pick that uses the coefficient estimated from the WAR model. This makes our model very similar to Sky’s original models, the differences being that we are including parameters for geography and that interacting each variable. Adding these parameters increases the adjusted R-Squared of the WAR expectancy model from .136 to .165, a more than 20% (3 percentage point) gain. This tells us we are explaining a non-negligible amount more of the variation in expected career value.

Results

Finally, on to the results! Let us begin with the pitchers. Using our two models we predict the probability of reaching the majors as well as the expected WAR for each of our four classes of pitcher.

Pick	Type	Majors
1	College Pitcher, Power State	99.9%
1	College Pitcher, Regular State	99.9%
1	HS Pitcher, Power State	70.6%
1	HS Pitcher, Regular State	97.0%

10	College Pitcher, Power State	77.9%
10	College Pitcher, Regular State	78.1%
10	HS Pitcher, Power State	52.4%
10	HS Pitcher, Regular State	58.7%

30	College Pitcher, Power State	57.5%
30	College Pitcher, Regular State	54.1%
30	HS Pitcher, Power State	48.9%
30	HS Pitcher, Regular State	45.1%

100	College Pitcher, Power State	43.7%
100	College Pitcher, Regular State	38.2%
100	HS Pitcher, Power State	47.0%
100	HS Pitcher, Regular State	37.4%

Pick	Type	eWAR
1	College Pitcher, Power State	14.4
1	College Pitcher, Regular State	20.7
1	HS Pitcher, Power State	33.9
1	HS Pitcher, Regular State	14.9

10	College Pitcher, Power State	3.8
10	College Pitcher, Regular State	5.4
10	HS Pitcher, Power State	8.9
10	HS Pitcher, Regular State	3.9

30	College Pitcher, Power State	2.0
30	College Pitcher, Regular State	2.9
30	HS Pitcher, Power State	4.7
30	HS Pitcher, Regular State	2.1

100	College Pitcher, Power State	1.0
100	College Pitcher, Regular State	1.4
100	HS Pitcher, Power State	2.3
100	HS Pitcher, Regular State	1.0

What first stands out in both charts is the oddity of high school pitchers from power states. While all other three groups of pitchers start with nearly a 100% chance of reaching the majors at the first overall pick, for high school pitchers from power states the probability is down to 70.6%. However as we move out the probability levels off around 50%. This result could warrant a further study, but I hypothesize that this lower rate of reaching the majors has to do with the high attrition rate for high school pitchers from power states. Due to the higher prevalence of year round pitchers in these states they are simply more likely to blow out their arms before ever sniffing the major leagues.

However, the intense focus on baseball in power states pays off for some of these high school pitchers, leading to pitchers with more refined skills and raw ability. If they can stay healthy, their advanced approach and ability can get them to the majors, hence the leveling off.

In the WAR expectancy chart we see high pitchers from power states dominate all other classes of pitcher. In terms of expected WAR they have over a 100% advantage over high school pitchers from regular states and about a 50% advantage over college pitchers from regular states.

On the college side of things we observe that power state college arms have a slightly higher propensity of making the majors that makes itself apparent midway through the first round. However in terms of expected WAR they are significantly below college arms from regular states and are close to on par with high school pitchers from regular states.

I am less concerned with this result as carving up the country into power states and regular states makes less sense for college amateurs than high school amateurs. This is because high school location is more indicative of a pitcher’s characteristics than college location. College location, while an important factor is less important than, say, college conference or division.

We can compare the relative advantages of each group here to arrive at a few conclusions. The tables below list the same information as above, but also include columns for relative advantage for expected chance of reaching the majors and for expected WAR. The final column compounds these two relative advantages, while not mathematically proper, this will give us an idea of how competing advantages interact and which group is overall more efficient.

Majors

Expected WAR

Compound Ratio

Pick

Power State

Regular State

Regular Advantage

Power State

Regular State

Power Advantage

70.6%

97.0%

1.37

33.9

14.9

2.27

1.65

63.4%

89.9%

1.42

22.7

10.0

2.27

1.60

52.4%

58.7%

1.12

8.9

3.9

2.27

2.02

48.9%

45.1%

0.92

4.7

2.1

2.27

2.46

100

47.0%

37.4%

0.80

2.3

1.0

2.27

2.85

We can see from the table the advantage power state high school pitchers have in terms of expected WAR dominates the relative advantage regular state high school pitchers have in terms of making the majors in the first dozen or so picks. In the later rounds they have the advantage in terms of reaching the majors as well creating a compounded advantage well over two, the highest of all groups.

Majors

Expected WAR

Compound Ratio

Pick

Power State

Regular State

Power Advantage

Power State

Regular State

Regular Advantage

99.9%

1.00

14.4

20.7

1.43

98.8%

99.3%

1.00

9.7

13.8

1.43

77.9%

78.1%

1.00

3.8

5.4

1.43

57.5%

54.1%

1.06

2.0

2.9

1.43

1.35

100

43.7%

38.2%

1.14

1.0

1.4

1.43

1.25

For college pitchers the advantage in terms of reaching the majors lies with the power state pitchers. The relative advantage is small however and is countered by the relative advantage regular state college pitchers have in terms of expected WAR. Overall college pitchers from regular states have a relative advantage over college pitchers from power states. However, as I mentioned earlier I would not put as much stock in this result as we should really be looking at conference and division for college players rather than the actual location of the state.

Now on to position players!

Pick	Type	Majors
1	College Hitter, Power State	100.0%
1	College Hitter, Regular State	100.0%
1	HS Hitter, Power State	99.0%
1	HS Hitter, Regular State	99.3%

10	College Bat, Power State	98.7%
10	College Bat, Regular State	95.9%
10	HS Bat, Power State	62.9%
10	HS Bat, Regular State	60.8%

30	College Bat, Power State	83.2%
30	College Bat, Regular State	72.5%
30	HS Bat, Power State	62.9%
30	HS Bat, Regular State	60.8%

100	College Bat, Power State	50.2%
100	College Bat, Regular State	42.7%
100	HS Bat, Power State	35.3%
100	HS Bat, Regular State	30.7%

Pick	Type	eWAR
1	College Hitter, Power State	56.2
1	College Hitter, Regular State	39.7
1	HS Hitter, Power State	21.1
1	HS Hitter, Regular State	29.8

10	College Bat, Power State	14.8
10	College Bat, Regular State	10.5
10	HS Bat, Power State	5.6
10	HS Bat, Regular State	7.9

30	College Bat, Power State	7.8
30	College Bat, Regular State	5.5
30	HS Bat, Power State	2.9
30	HS Bat, Regular State	4.2

100	College Bat, Power State	3.9
100	College Bat, Regular State	2.8
100	HS Bat, Power State	1.5
100	HS Bat, Regular State	2.1

First, let us focus our attention at the probability of reaching the majors. This chart overall takes quite a different shape. We observe a clear separation between college and high school bats with the advantage going to the college players. These results agree with previous studies that claim college bats are the safest draft choice. It is of note that college bats see a much more gradual drop off through the draft whereas pitchers and high school hitters had a much steeper decline in probability of reaching the majors before leveling off.

Further in terms of reaching the majors we see an advantage for power states in the college and high school groupings. On the college side we can again take this advantage with a grain of salt. More interesting is the result that high school position players from power states have an advantage in terms of reaching the majors, but it is only slight. In the initial Hardball Times study high school position players from power states appeared to be the most undervalued grouping. Now it appears that much of that separation was indeed explained away by controlling for draft pick.

Examining the WAR expectancy results we again see that college has an advantage over high school. However we get a very interesting result that high school bats from regular states have an advantage over high school bats from power states. Because of increased playing time and higher quality coaching and competition, we would expect this advantage to be flipped. It is very hard to explain why high school bats from regular states would have an advantage over high school bats from power states.

I couldn’t let this result go. The scout I spoke with was equally puzzled. He first reasoned that if you drafted ten guys from Florida, all of them could go on to be MLB regulars, but if you draft ten guys from New Jersey nine will never make the majors but one could be Mike Trout. This would make sense if we assumed that players in the power states had more ability and players from regular states had more raw athleticism. Thus if just one in however many of the raw athletes in the regular states puts it together and becomes an MVP caliber player, then that could explain the higher expected WAR.

This explanation is tidy and is to some degree satisfying. It would be nice if there was a simply tradeoff where the regular states yielded higher upside guys and the power state guys were better bets. However this doesn’t stand up to the scrutiny we are going to give it. My issue with this explanation is this question, "why aren’t there high upside raw athletes living in the power states as well?" My first possible explanation was that the stronger athletes who lack the present ability are being unnoticed in the power states because their peers with the higher present talent are over shadowing them. The scout I spoke with shot this down pretty quickly. He was confident he knew all the beast athletes in his region, even if they were committed to college to play other sports. He did however admit that in regards to high school position players "there may be a bias towards present ability."

Again we can look at a table of the relative advantages.

Majors

Expected WAR

Compound Ratio

Pick

Power State

Regular State

Power Advantage

Pick

Power State

Power Advantage

100.0%

1.00

56.2

39.7

1.41

100.0%

1.00

37.6

26.6

1.41

98.7%

95.9%

1.03

14.8

10.5

1.41

1.46

83.2%

72.5%

1.15

7.8

5.5

1.41

1.62

100

50.2%

42.7%

1.18

3.9

2.8

1.41

1.66

For college hitters it is relatively straightforward. The power states have the relative advantage in both reaching the majors and in expected WAR.

Majors

Expected WAR

Compound Ratio

Pick

Power State

Regular State

Regular Advantage

Power State

Regular State

Regular Advantage

Regular Advantage.

99.0%

99.3%

1.00

21.1

29.8

1.41

1.42

94.9%

95.6%

0.99

14.1

20.0

1.41

1.42

62.9%

60.8%

1.04

5.6

7.9

1.41

1.36

45.3%

41.2%

1.10

2.9

4.2

1.41

1.29

100

35.3%

30.7%

1.15

1.5

2.1

1.41

1.23

High school hitters have the relative advantage in terms of making the majors, but a disadvantage as it relates to expected WAR. When we compound these advantages it comes out in favor of the regular state position players by a moderate to small amount.

Conclusions

These models provide evidence that even when controlling for draft position and limiting our sample to the first 150 picks a geographic bias persists. The models show differences in groups of players across states not only in their chance of reaching the majors, but also in their expected career value. To summarize:

1. College pitchers from regular states seem to have an advantage over college pitchers from power states. This is a puzzling result, but is most likely due to the fact that state geography does not matter as much as conference and division at the college level.

2. High school pitchers from power states have a clear advantage of high school pitchers from regular states. The culture of youth pitching in the power states leads to a much higher ceiling for power state high school pitching prospects. However their increased intensity also may lead to more injuries before reaching the majors which may be to blame for their lower rate of reaching the majors in the early first round.

3. College hitters from power states have an unambiguous advantage over college hitters from regular states, both in terms of probability of reaching the majors and expected career WAR.

4. High school hitters from regular states have an advantage over those from power states in terms of expected WAR that trumps the power state advantage in terms of reaching the majors. This result is also quite puzzling and could be a result of a bias towards present ability.

Next week I will present the second portion of my research, which concerns the distribution of area scouts and top draftees throughout the country. We will examine whether or not MLB teams optimally distribute their scouts through the country in a manner such that the areas with the highest concentration of players are covered adequately. In addition to the maps presented at SABR I will also include a couple of team level maps to examine not just MLB teams in the aggregate level, but also individually.

Draft data courtesy of Baseball-Reference.com

Dan Meyer is a junior at Colby College majoring in Economics and Mathematical Sciences. You can follow him on Twitter @dtrain_meyer.