Methods section — Cluster analysis of ballparks: An exercise in failure and futility

New, 4 comments

Can you use park factors and cluster analysis to group ballparks into finer categories than just hitter's and pitcher's parks? Turns out you can't, but I gave it a good long try.

By John Choiniere Jan 28, 2015, 9:00am EST

Jesse Johnson-USA TODAY Sports

When people talk about how different ballparks play, there are certain terms and concepts that are regularly and almost exclusively used. Coors Field is a Hitter’s Park. So is Chase Field. Petco is a Pitcher’s Park. On rare occasion, it might get a little more specific — i.e., AT&T has Triples Alley — but in general the conversation doesn’t get much deeper. Just because it’s not done, however, doesn’t necessarily mean it can’t be done. Is it possible to use data, specifically park factors, to group major league ballparks into more well-defined groups than just good-for-hitters and good-for-pitchers?

I set out to accomplish this through the statistical concept of clustering. Now, this seems like an ideal spot to mention that I am NOT a statistician, and so there’s a decent chance that some or all of what follows is a complete misuse of good techniques. That said, I think everything was used correctly, and hopefully my editors here will kill this piece off entirely if it makes no sense.

Additionally, I’ll mention that everything done herein was done either in MySQL (for data storage/retrieval), Excel (for easily assembling the needed data), or R (for actual data analysis). MySQL and R are both free programs, and the Excel stuff would have been just as easy with any of its free competitors.

Still reading? Good - here we go.

The first step in the process was deciding on a set of park factors to thoroughly characterize the different aspects of how MLB parks play. I wanted to capture run environment, power hitting, the likelihood of batted ball types (grounder, popup, etc), true outcomes, and in-play hitting. This was complicated in part by the fact that publicly-available information on batted ball types is… not ideal. Although Retrosheet has complete batted ball-type records for all seasons since 2003, it relies on stringers to classify them, leading to inconsistencies in what counts as a pop-up, line drive, fly ball, etc. They’re perfectly reliable for on-ground versus in-air, though, so that’s the split I used.

Must Reads

Create your own park factors

John Choiniere

A guide to the theory and practice of creating park factors, using ISO as an example.

Must Reads

Create your own park factors

I ended up picking nine park factors, split by handedness (so, eighteen factors really): LinearWeightsRuns/G, LWTSRuns/G on ground balls, LWTSRuns/G on balls-in-air, Ground Balls/Ball-in-Air ratio, Home Runs per PA, Strikeouts per PA, Walks per PA, ISO, and BABIP. In each case, I calculated the park factors myself, using the procedure I’ve written about before at this very site and taking a five-year park average. The linear weights I used were also calculated myself empirically and individually for each season (see the wiki at Tango's site for details). Based on advice from BtBS’s own Neil Weinberg, I wanted to include a baserunning measure, but couldn’t find one that seemed both worthwhile and calculable from the data I have access to. Since these measures include batted ball types, I only looked at data from the last twelve seasons.

In order to group ballparks with respect to these eighteen park factors, I (or rather, the clustering algorithms) needed to know the distance between each park’s factors. This isn’t as straightforward as it may seem. If each factor, or variable, were independent of all the others, it would be an easy calculation that you’ve seen a million times before — he square root of the squared differences between the points in each dimension. In two dimensions it’d be sqrt((a2-a1)^2+(b2-b1)^2), which is easily enough expanded to the eighteen-dimensional space of my data. However, I can say confidently that these park factors are not independent of each other, but instead exhibit some covariance; that is, some of them vary in related ways. HRs per PA and ISO, for example, will tend to vary in the same direction. Because of this, I used an alternative distance metric called the Mahalanobis distance (calculated via the ecodist package in R). The Mahalanobis distance takes into account and corrects for this covariance in a way I’m not remotely qualified to describe, and normalized based on each variable’s standard deviation across the data set.

So, armed with a data set of park factors and a matrix of Mahalanobis distances between them, the only thing left was to decide on a clustering algorithm, run it, and show all my cool results. Unfortunately, as you may have guessed from the subtitle of this article, it wasn’t quite that easy.

I narrowed the wide variety of clustering choices available in R to the two types that were easiest for me (and probably for you) to understand: hierarchical clustering and partition-based clustering. Both have strengths and weaknesses, are easy enough to explain in plain English, and completely failed at finding reliable structures in the data here. Hierarchical clustering, specifically the bottom-up kind, starts with all data points being in their own group of one. It then merges the two closest and recalculates the distance between all groups (by one of several available methods). It continues until all points are put together into one giant group, and presents you with the distance between each group that was merged throughout the process. This is commonly presented as a dendrogram.

The above is the actual clustering dendrogram of the data I used here (note that I removed the data labels from the ends of all branches for visual clarity), and it doesn’t look promising. An ideal result would have featured a few significantly longer vertical lines, indicating a long distance between the connected groupings. Clusters are determine by horizontal slicing (i.e., at height = 275, a horizontal line crosses five vertical lines, each corresponding to a cluster containing all of the lower branches sprouting from it), and long distances/long vertical lines indicate good distinction between clusters. Nothing obvious pops out of the graph above; the best I can do is probably something in the 4-6 range (i.e., height between 250 and 300).

There are a few ways to assess the appropriateness of a clustering. Okay, for all I know there are a thousand ways, but I'm only to focus on a single one here — silhouette plotting, which can be produced easily from any clustering function in R. A silhouette plot shows, for each data point in the set, a number called width ranging from -1 to 1 that represents how well it fits in its assigned cluster, where 1 is an ideal fit, -1 is an ideal fit in the nearest other cluster, and 0 indicates that it's on the border between two clusters. This number can be averaged across a cluster or even across the whole data set to indicate how good of a grouping the clustering produced. The heuristic I found for silhouette width says that 0.75+ is a great fit, 0.5-0.75 is still pretty good, 0.25-0.5 is kinda bad, and under 0.25 means it's essentially worthless. Below, you can find the silhouette plots for clustering into 3-6 groups by the above method.

So, what you're seeing here is 1) poor clustering performance overall through all four numbers of clusters, but also 2) an improvement as the number of clusters increases. The 3-cluster chart has no groups above that 0.25 threshold, the 4-cluster has just one, the 5-cluster has the same on and a second that starts to approach 0.25, and the 6-cluster has a third group approaching it. I ran the same test extended out to 15 clusters, and the performance continues to marginally improve, but the improvement has to be balanced with the descriptive power of the groups — it wouldn't do much good to create 30 clusters, for example, since that would likely just be each individual park in its own group. I had some hope, based on the pattern in the silhouette plots, that going to nine clusters would be a big jump in overall silhouette width, since there were three more S ~ 0 groups to be split up, but that wasn't the case. Regardless of the number of groups, the results indicate that clustering by this method is possible, but doesn't necessarily reveal much.

The other clustering method I attempted to use is a variation on k-means clustering called k-medoids (or partitioning around medoids). Simply speaking, the method chooses k initial points to serve as medoids, or cluster centers, and assigns each point in the set to the cluster whose medoid is nearest. It then checks to see if there’s a more central point within each group than the initial one; if there is, the process repeats with the new centers. It continues until the center points stop changing.

The biggest drawback of this method is that it requires you to tell it in advance how many clusters there should be. However, there are way to address this issue. Much like I did for the hierarchical clustering, I iteratively tried using a variety of cluster numbers to see which produced the best silhouette width. The algorithm was quick enough on my computer that I was able to run it for clusters from 2 to 100. The average width for each is shown here:

I also used one of the tools available in R to attempt to predict the ideal number of groups. The NbClust packages provides a function that runs any of up to 30 tests that purport (and I use that word because I have no idea how they all work) to determine the correct cluster number, and then reports the average from all the tests you choose to run. I ran the full complement of tests, using k-means as the clustering algorithm of choice since k-medoids wasn’t available, on my data, and got 27 results. The large plurality (10/27) told me to use two clusters, and second-place was a tie at 6/27 metrics each between three and thirty clusters. So, in other words, over 80% of the tests I ran told me my best options were effectively meaningless groupings. No other number was in the results more than once; the other five results were 5, 6, 16, 23, and 29.

So, neither hierarchical clustering nor partitioning was able to effectively cluster ballparks well. This project was, all-in-all, a failure. Still, I'd like to be able to have learned and then share *something* from this whole process beyond simply how to go about doing it. For that, we'll turn to another (only slightly different) clustering method called fuzzy c-medoids.

Fuzzy c-medoids (FCM) partitions data in the same general way as k-medoids does, with two important changes. First, rather than assigning points to clusters, each point is given a likelihood of belonging to a cluster. Second, since there aren’t formal cluster assignments, in order to know if a medoid needs to move/change, a measurement of how "correct" each cluster is need to happen; FCM uses intra-cluster variance, and the user sets a threshold that the change in variance has to be below in order to stop the algorithm.

Using the FANNY function in R, I found six clusters and the degree to which each park resembles those clusters’ medoids. In order to make this work, I had to set the membership exponent to 1.15; this is lower than most recommendations I could find, but if I had set it to anything larger, it would give all parks equal likelihoods of belonging to each cluster. The medoids of the six — and remember, that means the most representative point of each group — were (Old) Yankee Stadium 2008, Great American Ballpark 2013, Busch Stadium 2013, Marlins Park 2014, Target Field 2013, and Chase Field 2010. Using Tableau, I've created a radar plot-style visualization of each park's park factors split by handedness, so that the reader can get a quick visual impression of each park.

Learn About Tableau

Generally speaking, two played as overall pitcher's park (Busch and Marlins), two were about neutral (Yankee and Target), and two were overall hitter's parks (GABP and Chase).

Old Yankee Stadium 2008 was a neutral park overall, or very slightly on the hitter’s side if you average all the park factors included here, but actually suppressed LWTS runs for right-handed batters (for lefties it was elevated). All other characteristics played essentially neutral; the only things more than a single point off exactly neutral were strikeout percent for right-handed batters, which was elevated, and walk percent for RHBs, which was suppressed.

Great American Ballpark 2013 saw inflated LWTS runs, completely from balls in the air — LWTS runs on grounders were neutral. Both sides of the plate had elevated power, both from home runs and from ISO. GABP 2013 favored balls in the air over grounders, but only slightly, and played with a neutral BABIP and elevated strikeouts and walks. There weren't big qualitative differences between righties and lefties.

Busch Stadium 2013 was the most severe pitcher’s park overall; for righties, only strikeout percent was above average; power, in particular, was suppressed — its HR/PA factor was only 89. Things were better for lefties, but only in terms of magnitude; again, with one exception (this time walk rate), all factors showed suppression.

Marlins Park 2014 was also a pitcher’s park overall, but unlike Busch 2013 mostly hovered near average across all but one factor. Suppression of home runs played by far the largest role in disadvantaging batters, especially left-handed ones (whose park factor for HR/PA was only 85, the lowest across all parks and categories). Right-handers also saw suppression of their walk rates; lefties were unaffected.

Target Field 2013 was a neutral park overall; runs by LWTS were neutral for righties and low for lefties, but for both were actually above-average on ground balls, which were also elevated in terms of frequency. Power was low on both sides but more so for right-handers. Walks were low for both sides; LHBs faced a neutral BABIP and elevated K rates while RHBs saw elevated BABIPs and low K rates.

Chase Field 2010 saw hitters being favored by nearly all categories included, with the sole exception of the walk rates of left-handed batters. Notably, and unlike the five other parks, the effect on power was stronger on ISO than on HR/PA. It also saw the highest LWTS runs on grounders of any park (both sides), and saw depressed strikeout rates on both sides.

The data for those six parks is retrievable through the embedded Tableau graphic, but I’d also like to include a table here of how strongly each park in the sample belonged to the clusters represented by the parks above. This is a long table — 240 rows of data — in which each ballpark and year (back to 2007) has its own line, the values of which will sum to 100 across the six cluster columns. The last column is a rudimentary confidence metric, where I take the maximum of the six cluster values and the difference between that and the second largest value, divide each by 100, and average them. A number near 1 in that column indicates a lot of surety in that park’s assignment to a cluster, and a number near zero is the opposite. You can see immediately that some parks seem to be harder to fit than others; I’d mention specifics, but I’m rapidly approaching 3000 words on a failed project, so I’ll leave that to you to investigate on your own. Here’s the table:

Park	Yankee	GABP	Busch	Marlins	Target	Chase	Confidence
Angel Stadium 2007	13	5	15	48	15	4	0.405
Angel Stadium 2008	0	12	85	0	1	1	0.790
Angel Stadium 2009	2	24	37	1	17	21	0.249
Angel Stadium 2010	6	5	43	16	13	17	0.345
Angel Stadium 2011	1	13	84	0	1	1	0.775
Angel Stadium 2012	3	25	64	0	5	2	0.517
Angel Stadium 2013	7	8	77	0	7	1	0.726
Angel Stadium 2014	1	5	84	0	8	2	0.800
Arlington 2007	3	52	1	0	33	11	0.356
Arlington 2008	60	10	0	0	29	1	0.455
Arlington 2009	0	95	4	0	1	0	0.928
Arlington 2010	0	96	2	0	1	1	0.953
Arlington 2011	0	93	1	0	1	4	0.913
Arlington 2012	0	88	4	0	3	5	0.859
Arlington 2013	0	91	7	0	0	2	0.873
Arlington 2014	0	82	9	0	2	6	0.777
AT&T 2007	0	1	63	0	34	1	0.461
AT&T 2008	0	1	79	0	20	0	0.691
AT&T 2009	0	2	72	0	25	0	0.592
AT&T 2010	0	2	11	0	81	6	0.753
AT&T 2011	0	0	0	0	98	1	0.975
AT&T 2012	1	2	3	0	92	2	0.907
AT&T 2013	11	5	2	0	77	4	0.720
AT&T 2014	2	35	25	0	26	13	0.218
Busch Stadium 2007	1	1	70	0	26	2	0.569
Busch Stadium 2008	1	2	48	2	34	14	0.309
Busch Stadium 2009	0	1	51	1	42	6	0.299
Busch Stadium 2010	0	0	91	0	9	0	0.863
Busch Stadium 2011	0	0	100	0	0	0	0.999
Busch Stadium 2012	0	0	100	0	0	0	0.995
Busch Stadium 2013	0	0	100	0	0	0	1.000
Busch Stadium 2014	0	0	100	0	0	0	1.000
Camden 2007	0	4	81	0	1	14	0.739
Camden 2008	33	7	4	0	17	38	0.218
Camden 2009	75	7	1	0	6	11	0.693
Camden 2010	76	20	0	0	2	2	0.661
Camden 2011	4	65	4	0	1	25	0.523
Camden 2012	3	30	1	0	9	57	0.416
Camden 2013	3	62	12	0	4	20	0.522
Camden 2014	0	75	14	0	1	10	0.686
Chase 2007	0	31	0	0	0	68	0.529
Chase 2008	0	1	0	0	0	99	0.986
Chase 2009	0	1	0	0	0	98	0.975
Chase 2010	0	0	0	0	0	100	1.000
Chase 2011	0	0	0	0	53	47	0.293
Chase 2012	1	1	0	0	74	25	0.609
Chase 2013	0	0	0	0	0	100	0.995
Chase 2014	0	3	1	0	0	95	0.941
Citi Field 2009	13	3	0	0	73	10	0.661
Citi Field 2010	0	0	2	0	52	46	0.290
Citi Field 2011	0	0	38	0	3	59	0.395
Citi Field 2012	0	1	71	0	2	25	0.589
Citi Field 2013	7	39	36	0	5	13	0.208
Citi Field 2014	2	72	25	0	0	1	0.600
Citizen's Bank 2007	2	95	1	0	2	0	0.942
Citizen's Bank 2008	5	58	20	0	11	5	0.483
Citizen's Bank 2009	99	1	0	0	0	0	0.988
Citizen's Bank 2010	96	3	0	0	1	0	0.942
Citizen's Bank 2011	87	11	0	0	1	0	0.815
Citizen's Bank 2012	100	0	0	0	0	0	0.996
Citizen's Bank 2013	91	9	0	0	0	0	0.869
Citizen's Bank 2014	99	1	0	0	0	0	0.984
Comerica 2007	0	26	8	0	13	52	0.391
Comerica 2008	3	36	7	0	16	37	0.190
Comerica 2009	0	22	62	0	8	8	0.512
Comerica 2010	3	11	6	0	35	45	0.281
Comerica 2011	10	12	9	0	35	33	0.188
Comerica 2012	6	4	51	0	23	16	0.390
Comerica 2013	1	3	50	0	13	33	0.338
Comerica 2014	1	1	23	0	10	65	0.533
Coors 2007	21	7	1	0	60	11	0.497
Coors 2008	50	4	1	0	39	7	0.309
Coors 2009	4	4	2	0	42	48	0.267
Coors 2010	11	31	6	0	35	18	0.193
Coors 2011	10	2	2	0	72	13	0.662
Coors 2012	4	14	2	0	33	47	0.301
Coors 2013	1	4	1	0	46	48	0.251
Coors 2014	3	2	0	0	52	43	0.303
Dodger Stadium 2007	99	0	0	0	0	0	0.989
Dodger Stadium 2008	100	0	0	0	0	0	0.997
Dodger Stadium 2009	38	0	1	0	61	0	0.419
Dodger Stadium 2010	61	3	10	0	24	2	0.493
Dodger Stadium 2011	46	3	6	0	44	1	0.241
Dodger Stadium 2012	0	2	85	0	11	2	0.793
Dodger Stadium 2013	1	2	65	0	9	23	0.539
Dodger Stadium 2014	6	40	46	0	5	3	0.261
Fenway 2007	17	53	5	0	2	23	0.419
Fenway 2008	6	71	8	0	1	14	0.644
Fenway 2009	8	59	22	0	1	9	0.477
Fenway 2010	1	26	60	0	1	13	0.465
Fenway 2011	0	3	2	0	4	90	0.880
Fenway 2012	4	10	11	0	8	67	0.615
Fenway 2013	0	5	18	0	17	59	0.504
Fenway 2014	2	13	4	0	34	48	0.305
GABP 2007	0	97	2	0	0	1	0.963
GABP 2008	1	98	1	0	0	1	0.971
GABP 2009	0	97	0	0	0	3	0.952
GABP 2010	0	100	0	0	0	0	0.994
GABP 2011	0	100	0	0	0	0	1.000
GABP 2012	0	100	0	0	0	0	1.000
GABP 2013	0	100	0	0	0	0	1.000
GABP 2014	0	100	0	0	0	0	1.000
Kauffmann 2007	1	10	59	0	7	22	0.481
Kauffmann 2008	1	12	61	0	25	2	0.484
Kauffmann 2009	1	5	19	8	47	20	0.368
Kauffmann 2010	2	5	3	0	83	7	0.797
Kauffmann 2011	1	1	5	1	50	42	0.291
Kauffmann 2012	0	3	17	0	38	41	0.218
Kauffmann 2013	0	3	3	0	2	91	0.897
Kauffmann 2014	1	51	11	0	6	31	0.348
Marlins Park 2012	8	39	35	6	9	4	0.213
Marlins Park 2013	0	0	0	100	0	0	1.000
Marlins Park 2014	0	0	0	100	0	0	1.000
Metrodome 2007	2	2	4	0	89	3	0.873
Metrodome 2008	10	2	5	0	79	4	0.743
Metrodome 2009	13	6	7	0	72	2	0.659
Miller 2007	0	99	0	0	0	0	0.991
Miller 2008	0	97	2	0	0	0	0.961
Miller 2009	0	14	86	0	0	0	0.789
Miller 2010	0	87	10	0	0	3	0.824
Miller 2011	1	66	19	0	1	13	0.568
Miller 2012	0	39	59	0	0	2	0.390
Miller 2013	0	83	3	0	1	13	0.766
Miller 2014	0	98	1	0	0	1	0.969
Minute Maid 2007	0	1	1	0	33	65	0.483
Minute Maid 2008	1	33	5	0	52	9	0.352
Minute Maid 2009	0	50	5	0	39	7	0.303
Minute Maid 2010	0	88	9	0	3	0	0.831
Minute Maid 2011	0	76	22	0	2	1	0.645
Minute Maid 2012	0	30	67	0	1	1	0.524
Minute Maid 2013	0	69	24	0	1	5	0.567
Minute Maid 2014	1	65	10	0	10	15	0.572
Nationals Park 2007	8	3	48	2	10	29	0.334
Nationals Park 2008	0	2	38	0	1	59	0.404
Nationals Park 2009	0	2	7	0	1	91	0.873
Nationals Park 2010	0	75	12	0	1	12	0.695
Nationals Park 2011	0	52	19	0	1	28	0.375
Nationals Park 2012	1	41	1	0	2	54	0.337
Nationals Park 2013	6	38	10	0	22	25	0.252
Nationals Park 2014	3	3	2	0	75	17	0.659
O.Co 2007	0	0	100	0	0	0	0.994
O.Co 2008	0	1	98	0	0	1	0.979
O.Co 2009	0	0	100	0	0	0	0.999
O.Co 2010	0	0	98	0	2	0	0.968
O.Co 2011	0	0	96	0	3	0	0.948
O.Co 2012	0	0	100	0	0	0	0.998
O.Co 2013	0	0	100	0	0	0	0.997
O.Co 2014	3	3	66	0	10	18	0.564
Petco 2007	15	22	56	1	5	1	0.453
Petco 2008	4	27	58	0	10	1	0.446
Petco 2009	3	14	81	0	1	0	0.743
Petco 2010	2	48	49	0	1	0	0.250
Petco 2011	1	10	82	0	5	1	0.771
Petco 2012	4	7	5	0	77	8	0.731
Petco 2013	1	12	82	0	4	1	0.764
Petco 2014	4	23	22	0	35	17	0.237
PNC Park 2007	0	0	100	0	0	0	0.999
PNC Park 2008	0	0	98	0	0	1	0.976
PNC Park 2009	0	0	99	0	0	0	0.992
PNC Park 2010	0	0	100	0	0	0	0.997
PNC Park 2011	0	0	100	0	0	0	0.997
PNC Park 2012	0	0	96	0	1	2	0.955
PNC Park 2013	0	0	1	0	96	2	0.955
PNC Park 2014	2	0	3	0	94	1	0.920
Progressive 2007	51	2	1	0	42	5	0.297
Progressive 2008	3	1	2	0	92	2	0.902
Progressive 2009	1	5	8	0	82	5	0.782
Progressive 2010	2	7	8	0	73	10	0.686
Progressive 2011	0	0	0	0	97	2	0.958
Progressive 2012	2	1	3	0	89	6	0.861
Progressive 2013	26	1	1	0	44	28	0.303
Progressive 2014	3	1	0	0	3	93	0.915
Rogers 2007	4	10	6	0	77	3	0.718
Rogers 2008	0	17	9	0	62	12	0.534
Rogers 2009	0	60	29	0	7	4	0.453
Rogers 2010	0	58	23	0	8	11	0.469
Rogers 2011	0	82	8	0	4	6	0.786
Rogers 2012	0	2	0	0	0	97	0.962
Rogers 2013	0	41	1	0	2	56	0.358
Rogers 2014	0	4	0	0	2	94	0.918
Safeco 2007	3	17	55	0	18	7	0.460
Safeco 2008	22	28	13	0	34	2	0.196
Safeco 2009	16	9	25	0	43	7	0.304
Safeco 2010	3	17	77	0	4	0	0.682
Safeco 2011	10	27	55	0	7	1	0.418
Safeco 2012	45	3	50	0	2	0	0.271
Safeco 2013	9	2	88	0	0	0	0.838
Safeco 2014	13	2	86	0	0	0	0.792
Shea Stadium 2007	1	1	8	0	12	77	0.710
Shea Stadium 2008	0	1	5	0	1	93	0.911
Sun Life 2007	2	63	20	0	9	7	0.529
Sun Life 2008	3	66	12	1	7	12	0.598
Sun Life 2009	2	34	15	1	14	33	0.178
Sun Life 2010	4	22	2	4	34	35	0.180
Sun Life 2011	5	13	1	1	51	29	0.361
Target 2010	1	1	16	0	78	4	0.707
Target 2011	1	0	0	0	99	0	0.984
Target 2012	0	0	0	0	100	0	1.000
Target 2013	0	0	0	0	100	0	1.000
Target 2014	0	0	0	0	100	0	1.000
Tropicana 2007	0	79	1	0	1	19	0.698
Tropicana 2008	0	94	2	0	0	3	0.927
Tropicana 2009	0	97	1	0	0	1	0.965
Tropicana 2010	0	99	1	0	0	0	0.988
Tropicana 2011	7	75	13	0	0	5	0.690
Tropicana 2012	98	1	0	0	1	0	0.977
Tropicana 2013	17	1	2	0	79	1	0.706
Tropicana 2014	60	5	11	0	24	0	0.478
Turner 2007	9	25	62	0	3	1	0.493
Turner 2008	99	1	0	0	0	0	0.988
Turner 2009	96	2	1	0	1	0	0.953
Turner 2010	100	0	0	0	0	0	0.995
Turner 2011	99	0	1	0	0	0	0.984
Turner 2012	100	0	0	0	0	0	0.996
Turner 2013	98	0	1	0	1	0	0.978
Turner 2014	42	5	40	0	11	3	0.218
US Cellular 2007	0	36	3	0	3	57	0.389
US Cellular 2008	0	9	12	0	1	78	0.722
US Cellular 2009	0	6	0	0	0	93	0.905
US Cellular 2010	0	3	0	0	0	97	0.960
US Cellular 2011	0	80	2	0	0	17	0.719
US Cellular 2012	0	96	2	0	0	2	0.947
US Cellular 2013	0	96	3	0	0	0	0.950
US Cellular 2014	3	86	10	0	1	0	0.809
Wrigley 2007	1	54	0	0	10	35	0.367
Wrigley 2008	1	49	0	0	23	26	0.361
Wrigley 2009	0	98	0	0	1	0	0.979
Wrigley 2010	0	92	3	0	4	1	0.905
Wrigley 2011	0	65	25	0	8	2	0.528
Wrigley 2012	0	54	25	0	18	2	0.420
Wrigley 2013	0	61	35	0	3	1	0.429
Wrigley 2014	0	22	51	0	5	22	0.402
(Old) Yankee Stadium 2007	100	0	0	0	0	0	1.000
(Old) Yankee Stadium 2008	100	0	0	0	0	0	1.000
Yankee Stadium 2009	57	38	2	0	2	1	0.385
Yankee Stadium 2010	68	27	0	0	1	4	0.542
Yankee Stadium 2011	11	88	0	0	0	1	0.830
Yankee Stadium 2012	2	97	0	0	0	0	0.961
Yankee Stadium 2013	2	98	0	0	0	0	0.964
Yankee Stadium 2014	79	21	0	0	0	0	0.684

You can also use the same silhouette measure to assess the appropriateness of the fuzzy cluster result; the widths are the weighted average of the silhouette widths you’d find if each point belong to each of the six clusters. Here’s that result — note that the image is a little different, since this had to be created manually in Excel rather than automatically in R.

One last point on the poor results seen here. You might be wondering if my utter failure to definitively cluster ballparks based on their park factors is because of the park factors themselves, which I calculated myself, rather than because of something about the parks themselves. In order to address this, I tried to repeat the above work (at least in part) on two other park factor data sets. First, I reduced the number of years being averaged to produce the park factors from five to three. This did not result in any significant difference in silhouette width for any of the clustering methods. Second, I replaced my park factors with the factors FanGraphs provides on their Guts! page, split by handedness still, and again found no meaningful differences. This leads me to conclude that it's because of the parks themselves, and not because of an artifact in the data, that clustering parks beyond hitter's and pitcher's parks does not lead to any definitive information.

. . .

Much of the information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org. Some other information courtesy of FanGraphs.

John Choiniere is a researcher and featured (occasional) writer at Beyond the Box Score. You can follow him on Twitter at @johnchoiniere.

Methods section — Cluster analysis of ballparks: An exercise in failure and futility

Share this story

Share All sharing options for: Methods section — Cluster analysis of ballparks: An exercise in failure and futility

Must Reads

Create your own park factors

John Choiniere

Must Reads

More From Beyond the Box Score

Loading comments...

Share this story

All sharing options for: Methods section — Cluster analysis of ballparks: An exercise in failure and futility