clock menu more-arrow no yes mobile

Filed under:

Learning R: Calculating the odds of .400

Showing my math from an earlier post.

Texas Rangers v Colorado Rockies Photo by Dustin Bradford/Getty Images

Earlier this week, while everyone was wondering whether Charlie Blackmon could hit .400, I wondered if Donovan Solano could do the same thing.

I maintain that Charlie Blackmon hitting .400 would be fun but predictable and therefore also kind of boring. Blackmon hitting .400 would be a lot like Star Wars: The Force Awakens. Episode VII was a fun movie, but J.J. Abrams wasn’t interested in telling any new Star Wars stories, so you could see every plot point coming a parsec away. Donovan Solano hitting .400 would be more like Dark City: bizarre, obscure, and you’ll still think about 22 years later. I don’t think I ever need to see Force Awakens again, but I had to stop writing this sentence so I could see if Dark City was on any streaming platforms. (It’s on Vudu whatever that is.)

My estimation was that if Solano were a true talent .300 hitter, he had roughly a 0.7 percent chance of accomplishing the feat. That was before Solano went 3-for-8 during the work week which not only brought down his average, but reminded us that Solano has to fight for playing time with rising star(ter) Mauricio Dubón.

Whether Solano (or Blackmon, I guess) can do it isn’t important for this particular post. Instead, I’m going to focus on how I calculated the odds using R.

Last week, I simulated an expanded postseason which saw the 29-31 Milwaukee Brewers win the World Series. To do that, I followed code provided in Analyzing Baseball Data with R, and to simulate a set number of games between two teams, the rmultinom() function is used. What this function does is it essentially performs a series of coin flips using given probabilities. In the case of simulating a game, it’s the probability that one team will win versus the other.

rmultinom() needs three arguments in order to function: n, size, and prob. n is the number of times the simulation is run, size is the number of “coin flips” performed in each simulation, and prob is the probability of each outcome which is usually defined as a vector in a data frame or a value in the R global environment.

Simulating a seven-game series looks like this:

rmultinom(1, 7, prob)

I figured this could also be used to determine whether or not a player could get a hit. In every at bat, a player can either get a hit or make an out (or reach on an error). The probability of getting a hit is helpfully approximated by batting average.

To simulate a typical, four at-bat game for a .300 batter, you could first define a value for their batting average and then use the rmultinom() function.

average <- c(.300, .700)

rmultinom(1, 4, average)

This returns two values. The first value is how many hits the batter gets, and the second is how many outs they make.

To simulate how many hits a .300 batter would get over the course of the season, simply change the size to how many at bats they are expected to get. To change the number of seasons, change n.

When seeing how likely it is that Donovan Solano (or Charlie Blackmon) hits .400 over the season, it’s necessary to first determine how likely it is that the batter gets a hit. For Blackmon, it’s a little easier since he has had more consistent production and playing time. Blackmon’s batting average between 2017 and 2019 was .312, so the value for him getting a hit could be set to:

blackmon <- c(.312, .688)

Solano is a little trickier because he hasn’t played in the majors that much recently and his production is all over the place. Using his career batting average of .277 seems too low and using his average over the last three years (.356!) is way too high. Instead, Solano becomes our generic .300 hitter.

solano <- c(.300, .700)

Then, we need to find how many more at bats each batter will get by the end of the season. To do this I divided 60 by the number of games played by their team and multiplied that value by how many at bats they had so far. That gives a good enough approximation of how many at bats they’ll have at season’s end if they continue to get the same amount of playing time. Of course, we need to subtract how many at bats they already have.

For Blackmon, I’m estimating 240 at bats total and 164 at bats remaining. For Solano, that’s 180 at bats total and 117 remaining. To hit .400, Blackmon needs 96 hits total, so he needs 62 more hits. Solano needs 72 hits total and 44 more hits.

Now we can simulate 1,000 seasons for each batter and save the results in the global environment. For Blackmon:

rmultinom(1000, 164, blackmon) -> blackmon.sims

For Solano:

rmultinom(1000, 117, solano) -> solano.sims

It’s easier to look at this data if it’s presented in a matrix, and that’s done by entering the following:

solanodf <- data.frame(matrix(solano.sims, 1000, 2, byrow = TRUE, dimnames = list(c(1:1000), c(“Hits”, “Outs”))))

blackmondf <- data.frame(matrix(blackmon.sims, 1000, 2, byrow = TRUE, dimnames = list(c(1:1000), c(“Hits”, “Outs”))))

Since we know how many hits Solano and Blackmon need, we can use the mutate() function to determine whether they hit .400 in that simulation.

solanodf <- solanodf %>%

mutate(Yes = ifelse(Hits >= 44, 1, 0))

blackmondf <- blackmondf %>%

mutate(Yes = ifelse(Hits >= 62, 1, 0))

Now it’s easy to count up how many times each batter hit their threshold.

summarize(solanodf, sum(Yes))

summarize(blackmondf, sum(Yes))

Solano hit .400 38 times in 1000 tries which now gives him a 3.8 percent chance. For Solano, sitting out two games between Monday and Friday actually increased his chances quite a bit. That is, if he’s actually (only?) a .300 hitter. Blackmon did it 54 times which gives him a 5.4 percent chance. On Wednesday, I gave him a 7.9 percent chance (using his career batting average of .307), so going 0-for-8 between Wednesday and Friday didn’t help.

Kenny Kelly is the managing editor of Beyond the Box Score. You can follow him on Twitter @KennyKellyWords.