Beyond the Box Score: An SB Nation Community

Navigation: Jump to content areas:


Sports blogs for fans, by fans.
New Blog: The Boxing Bulletin for Boxing Fans!

lineup simulator in C

This is probably only of interest to a very few select people here, but I've finished the first version of my lineup simulator that I mentioned in this diary. The code is here. See this post for some suggestions and requirements for compiling.

Right now it seems to give consistently low numbers, on the order of 1 to 1.2 runs lower per game than the run estimator developed by Ken Arneson that was adapted over at Baseball Musings. I'm not sure why it's so low, so if anyone wants to look over my code and see if I'm not doing something really stupid, that would be nice. It's reasonably well commented so it shouldn't be too hard to understand, but feel free to ask questions. (I'm testing it on the core lineup of the 2005 Astros. I'm using aggregated stats for the pitching staff as one player.)

Star-divide

As far as performance goes, on a G4-based Mac Mini or PowerBook, it takes a little under 2.5 days to run the simulator for all 9! lineups. For any one particular lineup, the mean runs per game is calculated every 1000 games and inserted into a 100 element array (both numbers are arbitrary). When the standard deviation of the numbers in the 100 element array falls lower than 0.002 (i.e, the mean has stabilized), it moves on to the next lineup. This generally happens somewhere between 150,000 and 250,000 games, translating to something like 72 billion games total. I've tweaked the code to run as fast as possible; the only way to get a major increase in speed at this point would be to parallelize the code.

Update [2006-3-23 14:1:50 by false cognate]: Two reasons I've thought of that might be why it's so much lower than the actual run total for the Astros last year - the Astros were near the top of the league in steals, which aren't accounted for in my simulator, and they also had some significant platooning - Lamb and Palmeiro both had significant at bats (322 and 204 vs. Burke's 318 who is in my lineup) and both have better pop than Burke. However, this doesn't account for the differential with the run estimator at Baseball Musings.

0 recs  |  Comment 17 comments

Story-email Email Printer Print

Comments

Display:

Hi FC.
I'm going to be out of town for a while, so I'm going to look at this in early April.  But, it sounds like you did an interesting job and I look forward to fiddling with a new toy.

by salb918 on Mar 23, 2006 2:17 PM EST reply actions   0 recs

Seems like you could fake SBs
relatively easily.

When I made an abortive attempt to write a game simulator a while back, I did something like this:

  1. give every player a "steal rating" -- basically, if they have the opportunity to steal, what are the odds they'll try?   --I don't remember how I calc'd that, but it seems pretty easy, and most players can just be set at 0.
  2. before processing each at-bat, check to see if there's a steal situation -- basically, just "if runner on first, no other runners" is good enough for now.
  3. then possibly attempt steal.
step 2 could be adapted almost wholesale to find possible GIDP situations...but then again you'd need numbers on GIDP/(possible GIDPs)...that's retrosheet-able, but I haven't done it and I don't know if anybody has.

I don't write or read C, sadly.  (I write in Python, my bestest friend.)  I can follow the code well enough to see you haven't made any serious logical errors, though.

Hey! I might have just figured out a big part of the discrepancy: errors!  In the 2005 NL, the difference between R/G and ER/G is .25.  Not the whole thing, but a big chunk right there.  I would imagine a good stealing team might give you another .05 or .1 ...sac flies as well.  

Also, you could be a bit more liberal with extra bases on singles and doubles.  I like your compromise for the sake of speed and simplicity, but if you want to go for upped accuracy, you could keep my slapdash speed estimator and let that determine the likelihood somebody goes for an extra base.  Maybe that'd be good for another .05.

To try to reconcile with the actual total, you might try to create some sort of Lamb/Duke monster using their L/R splits or something that would reflect how well that lineup spot does on the daily basis.

Damn, this stuff is fun :).

---
http://www.BrewCrewBall.com

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb on Mar 23, 2006 4:14 PM EST reply actions   0 recs

Player-dependent situations
would require a pretty huge rewrite. Right now the program just keeps track of whether a base is occupied or not but doesn't track which player is on which base. I was thinking about this problem for a while and I'm not sure how to handle it; I think it would probably slow things down quite a bit but I agree that your general idea is probably the most straightforward method and would lead to reasonable results.

As was pointed out to me before, Baseball Prospectus does in fact track DP opportunities for players, but I'm still new enough to the whole sabermetrics thing that I haven't convinced myself to shell out the money for a subscription.

Rolling Lamb and Burke's stats together might not be such a bad idea; it was some pretty weird platooning last year with Berkman/Lamb at 1st and Burke/Berkman out in LF with Palmeiro subbing all over the outfield. Defensive errors are something I hadn't even begun to think about, but yes, that does help account for about 25% of the discrepancy with respect to actual runs scored. I think steals and speed are probably a pretty significant portion as well.

If I can get the code parallelized, I might try rewriting the code to track which player is on which base and thus start including steals, GIDP, and sacrifices.

by false cognate on Mar 23, 2006 4:56 PM EST up reply actions   0 recs

ahh, ok.
gotcha.  I know full well how difficult it is to track who's on base when from my attempts to parse natural-language pbp logs.  

I suppose you could take a half-step and work in the possibility of steals in a non-player-specific way, but that wouldn't shed much light on optimal lineup deployment.  If Taveras leads off b/c he can steal, it wouldn't make much sense to basically award everybody on the team 11 SBs :).

Another thing you might consider to deal with the platoon thing would be, instead of using player names, use aggregate lineup deployment from the entire season.  I'll bet you could get that data from Pinto's day-by-day database: instead of using Taveras as a possible person to plug into the lineup, use the stats that all Astros #1 hitters gathered when they batted leadoff.  Same idea as you've done with pitchers.

Speaking of which: there's another .2 rpg (total WAG) or so: late-inning pinch hitters.  You've got a 9th place hitter in the late innings performing like Clemens and Pettitte when it should be Palmiero.  The aggregate approach would go a long way to solving that problem.  An even better solution would be to split 9th (and maybe 8th?) place hitters by AB-in-game -- that is, have one set of stats for the 9th place hitter's first AB (usually pitcher), and so on.  A half-way measure would be to split the game in halves...one of those two is probably necessary because if PHs are getting, say, 2 ABs/g, the 9th place hitter would look very good overall when they really shouldn't.

I hope that all make sense.  

And I realize I'm throwing out a bunch of ideas that would all slow your program down a lot :).

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb on Mar 23, 2006 5:35 PM EST up reply actions   0 recs

A faster way
instead of trying to get all kinds of complex tracking goin on in memory, why not use a mysql database? you can load the players into it, and while you're using them they can have a field that indicates what base they're on. that way all the tracking is done in a database and not in some huge memory construct.

by cephyn on Mar 23, 2006 11:01 PM EST up reply actions   0 recs

that's going to be
hella slow.

No matter how fast your database is, building a program around querying a database is going to be several orders of magnitude slower than just putting everything in memory and writing a specialized program in C. Memory is cheap - hell both machines I've been running this program on have at least a gig of RAM. And every last bit of speed counts if you're going to do all 9! permutations...

by false cognate on Mar 24, 2006 2:21 AM EST up reply actions   0 recs

well
the next step might be to upgrade to C++ so you can create player objects that hold their current base inside.

by cephyn on Mar 24, 2006 12:01 PM EST up reply actions   0 recs

it's not a bad plan
I already have a struct for players in C; moving to C++ wouldn't be too terribly hard. I guess I'd have to find out if the particular random number generator library is ported to C++, or rewrite it myself - it shaved probably 7 - 10 percent off of the total run time, so I'm loath to give it up...

by false cognate on Mar 24, 2006 12:25 PM EST up reply actions   0 recs

fun project
I like stuff like this hehe. this would be fun to play around with.

I notice you still have an if/else in there, moving it to switch/case might improve run time.

Good work though! I have some computers doing nothing at home i could set to task if you wanted some computational help on anything. i also have a dual core laptop thats pretty fast. 8)

by cephyn on Mar 24, 2006 5:11 PM EST up reply actions   0 recs

Like I said erlier
let the players advance an additional base when there are two outs.  That will help you a lot.

by salb918 on Mar 24, 2006 1:02 AM EST up reply actions   0 recs

I did add this
and it only really upped the runs per game by a couple of tenths. Pretty good - 20 to 40 runs per season - but not nearly enough. Something else is going on, and I don't know what...

by false cognate on Mar 24, 2006 2:22 AM EST up reply actions   0 recs

steal opps?
Does anyone track steal opportunities? Probably what I mean is occasions when a player is on first with second open, and no outs.

I think if someone actually tracks this stat, I could write a steal subroutine into my simulator with minimal problems...

by false cognate on Mar 25, 2006 2:33 AM EST reply actions   0 recs

I don't think anybody tracks it.
two things: why only with no outs?

also, like I think I wrote earlier up there somewhere, this would help you get the run totals closer to correct, but it would probably take away from the value of the lineup simulator, given that part of the value of a proper lineup may be optimizing placement of high-steal guys.  

I might be able to extract a rough number for you from '04 retrosheet logs...not sure what I have in what form right now.

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb on Mar 25, 2006 7:53 AM EST up reply actions   0 recs

No outs
because of this article over at Baseball Prospectus. It seems to be the best situation for steals, although I suppose I could code in the probabilities based on the entire run expectation table. Basically I was thinking of this particular algorithm:

Before determining what the result of a particular at bat is, the algorithm would check to see if there was a man on first with second open (and third as well, if you believe the BP article). If so, it would then generate a random number and compare it against a calculated attempted steal probability [(SB + CS) / Steal opps]. If the steal attempt is made, it would then create another random number and compare against steal percentage [SB / (SB + CS)].

I think that it would actually improve the lineup simulator, because while the leadoff man is probably the most likely to be on first with no outs, and second and third open, if you calculate consider steal opps as only when there are no outs, it should help renormalize - i.e., while Biggio had only 11 steals (and 1 CS), he probably had far fewer steal opps than Taveras since he was generally second in the lineup last year. That is, successful steal percentage is probably a good measure of speed, total steal attempts is probably a poor measure of how often to attempt a steal because of the inherent lineup dependent nature of when you should attempt a steal.

by false cognate on Mar 25, 2006 12:56 PM EST up reply actions   0 recs

gotcha
I see your point about improving the simulator--iff you're using player-specific information, which I thought you weren't doing yet, if at all.

As for the no outs part, I guess you have to decide whether you're trying to optimize lineup construction in an "ideal" world where Keith Woolner is managing, or you wan to optimize lineup construction for real-life managers.  If the first, then yeah, the run-expectation table is the way to go.  If the second, you'd probably want the rate at which steals are attempted for each base/out situation.  (At least for each number of outs with runner on first, second open.)

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb on Mar 25, 2006 2:39 PM EST up reply actions   0 recs

player-specific information
isn't in there yet, but I think I figured out a way to do it without incurring a ton of overhead. It might even let me work in GIDP, although I'd probably need to subscribe to BP to get the relevant statistics.

by false cognate on Mar 25, 2006 4:13 PM EST up reply actions   0 recs

Comments For This Post Are Closed


User Tools

We use numbers and stuff.
Community Guidelines
Why be a member?
Start posting on Beyond the Box Score »

Join SB Nation and dive into communities focused on all your favorite teams.

FanPosts

Community blog posts and discussion.

Recent FanPosts

Leopold_butter_scotch_southpark_small
Using the TVC
Small
Determining Batted Ball Rates using Pitch Type and Location
Small
a new xBABIP calculator
Img587561916661595
Top 15 high school MLB draft prospects
Small
PZR-based Win Values 2001-2006
Small
The "30 parks on a budget" challenge
Sunflower_small
World Series Simulation, Game #6
Small
JT20 Dynasty League
E52205a2_small
New Look
Sth70021_small
Exploring Hit f/x, Albeit Badly

+ New FanPost All FanPosts >

FanShots

Quick hits of video, photos, quotes, chats, links and lists that you find around the web.

Recent FanShots

Defensive Projections Take 2
The Baseball Nation Sim League has an opening
Primer on BaseRuns
Cool Baseball Infographics
ESPN's Jerry Crasnick on defensive metrics
I’m also a follower, since Brian Bannister’s on our team, of sabermetric st...
Top Ten Baseball-Reference.com's Sponsorships
Primer on Linear Weights
JC Bradbury on "Hot Stove Myths"

+ New FanShot All FanShots >

BtB on Twitter

Main Feed: @BtBScore

Tommy B: @tommy_bennett
Sky: @BtB_Sky
Dan: @dturkenk
Harry: @harrypav
Jinaz: @jinazreds
Jack: @jh_moore
Erik: @Erik_Manning
Tommy R: @trancel
Justin: @justinbopp

Subscribe to BtB via Email

Enter your email address:

Delivered by FeedBurner

BtB Goes Social


Managers

Nando_small R.J. Anderson

Limes_125_small Sky Kalkman

E52205a2_small Tommy Bennett

Editors

Face_small Harry Pavlidis

Rawlings_baseball_bigger_small Dan Turkenkopf

770insig_small Jeff Zimmerman (TucsonRoyal)

Aviles_small Justin Bopp

Authors

Banny_small erik

Raysring1_small Tommy Rancel

Jinaz-reds-avatar_small JinAZ

Jmlogo_small Jack Moore

1753738656_110919ebe9_o_small vivaelpujols

1_small Graham

Baseball_small Mike Rogers

Redcap_small SFiercex4

Small Patrick Clark

Walter_album_small Walter Fulbright