Navigation: Jump to content areas:


Pro Quality. Fan Perspective.
Login-facebook
Around SBN: VIDEO: Austin Rivers' Buzzer Beater Finishes Off UNC

lineup simulator in C

This is probably only of interest to a very few select people here, but I've finished the first version of my lineup simulator that I mentioned in this diary. The code is here. See this post for some suggestions and requirements for compiling.

Right now it seems to give consistently low numbers, on the order of 1 to 1.2 runs lower per game than the run estimator developed by Ken Arneson that was adapted over at Baseball Musings. I'm not sure why it's so low, so if anyone wants to look over my code and see if I'm not doing something really stupid, that would be nice. It's reasonably well commented so it shouldn't be too hard to understand, but feel free to ask questions. (I'm testing it on the core lineup of the 2005 Astros. I'm using aggregated stats for the pitching staff as one player.)

Star-divide

As far as performance goes, on a G4-based Mac Mini or PowerBook, it takes a little under 2.5 days to run the simulator for all 9! lineups. For any one particular lineup, the mean runs per game is calculated every 1000 games and inserted into a 100 element array (both numbers are arbitrary). When the standard deviation of the numbers in the 100 element array falls lower than 0.002 (i.e, the mean has stabilized), it moves on to the next lineup. This generally happens somewhere between 150,000 and 250,000 games, translating to something like 72 billion games total. I've tweaked the code to run as fast as possible; the only way to get a major increase in speed at this point would be to parallelize the code.

Update [2006-3-23 14:1:50 by false cognate]: Two reasons I've thought of that might be why it's so much lower than the actual run total for the Astros last year - the Astros were near the top of the league in steals, which aren't accounted for in my simulator, and they also had some significant platooning - Lamb and Palmeiro both had significant at bats (322 and 204 vs. Burke's 318 who is in my lineup) and both have better pop than Burke. However, this doesn't account for the differential with the run estimator at Baseball Musings.

Comment 17 comments  |  0 recs  | 

Do you like this story?

Comments

Display:

Hi FC.
I'm going to be out of town for a while, so I'm going to look at this in early April.  But, it sounds like you did an interesting job and I look forward to fiddling with a new toy.

by salb918 on Mar 23, 2006 2:17 PM EST reply actions  

Seems like you could fake SBs
relatively easily.

When I made an abortive attempt to write a game simulator a while back, I did something like this:

  1. give every player a "steal rating" -- basically, if they have the opportunity to steal, what are the odds they'll try?   --I don't remember how I calc'd that, but it seems pretty easy, and most players can just be set at 0.
  2. before processing each at-bat, check to see if there's a steal situation -- basically, just "if runner on first, no other runners" is good enough for now.
  3. then possibly attempt steal.
step 2 could be adapted almost wholesale to find possible GIDP situations...but then again you'd need numbers on GIDP/(possible GIDPs)...that's retrosheet-able, but I haven't done it and I don't know if anybody has.

I don't write or read C, sadly.  (I write in Python, my bestest friend.)  I can follow the code well enough to see you haven't made any serious logical errors, though.

Hey! I might have just figured out a big part of the discrepancy: errors!  In the 2005 NL, the difference between R/G and ER/G is .25.  Not the whole thing, but a big chunk right there.  I would imagine a good stealing team might give you another .05 or .1 ...sac flies as well.  

Also, you could be a bit more liberal with extra bases on singles and doubles.  I like your compromise for the sake of speed and simplicity, but if you want to go for upped accuracy, you could keep my slapdash speed estimator and let that determine the likelihood somebody goes for an extra base.  Maybe that'd be good for another .05.

To try to reconcile with the actual total, you might try to create some sort of Lamb/Duke monster using their L/R splits or something that would reflect how well that lineup spot does on the daily basis.

Damn, this stuff is fun :).

---
http://www.BrewCrewBall.com

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb @ Beyond the Box Score on Mar 23, 2006 4:14 PM EST reply actions  

Player-dependent situations
would require a pretty huge rewrite. Right now the program just keeps track of whether a base is occupied or not but doesn't track which player is on which base. I was thinking about this problem for a while and I'm not sure how to handle it; I think it would probably slow things down quite a bit but I agree that your general idea is probably the most straightforward method and would lead to reasonable results.

As was pointed out to me before, Baseball Prospectus does in fact track DP opportunities for players, but I'm still new enough to the whole sabermetrics thing that I haven't convinced myself to shell out the money for a subscription.

Rolling Lamb and Burke's stats together might not be such a bad idea; it was some pretty weird platooning last year with Berkman/Lamb at 1st and Burke/Berkman out in LF with Palmeiro subbing all over the outfield. Defensive errors are something I hadn't even begun to think about, but yes, that does help account for about 25% of the discrepancy with respect to actual runs scored. I think steals and speed are probably a pretty significant portion as well.

If I can get the code parallelized, I might try rewriting the code to track which player is on which base and thus start including steals, GIDP, and sacrifices.

by false cognate on Mar 23, 2006 4:56 PM EST up reply actions  

ahh, ok.
gotcha.  I know full well how difficult it is to track who's on base when from my attempts to parse natural-language pbp logs.  

I suppose you could take a half-step and work in the possibility of steals in a non-player-specific way, but that wouldn't shed much light on optimal lineup deployment.  If Taveras leads off b/c he can steal, it wouldn't make much sense to basically award everybody on the team 11 SBs :).

Another thing you might consider to deal with the platoon thing would be, instead of using player names, use aggregate lineup deployment from the entire season.  I'll bet you could get that data from Pinto's day-by-day database: instead of using Taveras as a possible person to plug into the lineup, use the stats that all Astros #1 hitters gathered when they batted leadoff.  Same idea as you've done with pitchers.

Speaking of which: there's another .2 rpg (total WAG) or so: late-inning pinch hitters.  You've got a 9th place hitter in the late innings performing like Clemens and Pettitte when it should be Palmiero.  The aggregate approach would go a long way to solving that problem.  An even better solution would be to split 9th (and maybe 8th?) place hitters by AB-in-game -- that is, have one set of stats for the 9th place hitter's first AB (usually pitcher), and so on.  A half-way measure would be to split the game in halves...one of those two is probably necessary because if PHs are getting, say, 2 ABs/g, the 9th place hitter would look very good overall when they really shouldn't.

I hope that all make sense.  

And I realize I'm throwing out a bunch of ideas that would all slow your program down a lot :).

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb @ Beyond the Box Score on Mar 23, 2006 5:35 PM EST up reply actions  

A faster way
instead of trying to get all kinds of complex tracking goin on in memory, why not use a mysql database? you can load the players into it, and while you're using them they can have a field that indicates what base they're on. that way all the tracking is done in a database and not in some huge memory construct.

by cephyn on Mar 23, 2006 11:01 PM EST up reply actions  

that's going to be
hella slow.

No matter how fast your database is, building a program around querying a database is going to be several orders of magnitude slower than just putting everything in memory and writing a specialized program in C. Memory is cheap - hell both machines I've been running this program on have at least a gig of RAM. And every last bit of speed counts if you're going to do all 9! permutations...

by false cognate on Mar 24, 2006 2:21 AM EST up reply actions  

well
the next step might be to upgrade to C++ so you can create player objects that hold their current base inside.

by cephyn on Mar 24, 2006 12:01 PM EST up reply actions  

it's not a bad plan
I already have a struct for players in C; moving to C++ wouldn't be too terribly hard. I guess I'd have to find out if the particular random number generator library is ported to C++, or rewrite it myself - it shaved probably 7 - 10 percent off of the total run time, so I'm loath to give it up...

by false cognate on Mar 24, 2006 12:25 PM EST up reply actions  

fun project
I like stuff like this hehe. this would be fun to play around with.

I notice you still have an if/else in there, moving it to switch/case might improve run time.

Good work though! I have some computers doing nothing at home i could set to task if you wanted some computational help on anything. i also have a dual core laptop thats pretty fast. 8)

by cephyn on Mar 24, 2006 5:11 PM EST up reply actions  

Like I said erlier
let the players advance an additional base when there are two outs.  That will help you a lot.

by salb918 on Mar 24, 2006 1:02 AM EST up reply actions  

almost sure he already put that in
n/t
Daily Brewers Blog: BrewCrewBall.com

by jeffbcb @ Beyond the Box Score on Mar 24, 2006 1:36 AM EST up reply actions  

I did add this
and it only really upped the runs per game by a couple of tenths. Pretty good - 20 to 40 runs per season - but not nearly enough. Something else is going on, and I don't know what...

by false cognate on Mar 24, 2006 2:22 AM EST up reply actions  

steal opps?
Does anyone track steal opportunities? Probably what I mean is occasions when a player is on first with second open, and no outs.

I think if someone actually tracks this stat, I could write a steal subroutine into my simulator with minimal problems...

by false cognate on Mar 25, 2006 2:33 AM EST reply actions  

I don't think anybody tracks it.
two things: why only with no outs?

also, like I think I wrote earlier up there somewhere, this would help you get the run totals closer to correct, but it would probably take away from the value of the lineup simulator, given that part of the value of a proper lineup may be optimizing placement of high-steal guys.  

I might be able to extract a rough number for you from '04 retrosheet logs...not sure what I have in what form right now.

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb @ Beyond the Box Score on Mar 25, 2006 7:53 AM EST up reply actions  

No outs
because of this article over at Baseball Prospectus. It seems to be the best situation for steals, although I suppose I could code in the probabilities based on the entire run expectation table. Basically I was thinking of this particular algorithm:

Before determining what the result of a particular at bat is, the algorithm would check to see if there was a man on first with second open (and third as well, if you believe the BP article). If so, it would then generate a random number and compare it against a calculated attempted steal probability [(SB + CS) / Steal opps]. If the steal attempt is made, it would then create another random number and compare against steal percentage [SB / (SB + CS)].

I think that it would actually improve the lineup simulator, because while the leadoff man is probably the most likely to be on first with no outs, and second and third open, if you calculate consider steal opps as only when there are no outs, it should help renormalize - i.e., while Biggio had only 11 steals (and 1 CS), he probably had far fewer steal opps than Taveras since he was generally second in the lineup last year. That is, successful steal percentage is probably a good measure of speed, total steal attempts is probably a poor measure of how often to attempt a steal because of the inherent lineup dependent nature of when you should attempt a steal.

by false cognate on Mar 25, 2006 12:56 PM EST up reply actions  

gotcha
I see your point about improving the simulator--iff you're using player-specific information, which I thought you weren't doing yet, if at all.

As for the no outs part, I guess you have to decide whether you're trying to optimize lineup construction in an "ideal" world where Keith Woolner is managing, or you wan to optimize lineup construction for real-life managers.  If the first, then yeah, the run-expectation table is the way to go.  If the second, you'd probably want the rate at which steals are attempted for each base/out situation.  (At least for each number of outs with runner on first, second open.)

Daily Brewers Blog: BrewCrewBall.com

by jeffbcb @ Beyond the Box Score on Mar 25, 2006 2:39 PM EST up reply actions  

player-specific information
isn't in there yet, but I think I figured out a way to do it without incurring a ton of overhead. It might even let me work in GIDP, although I'd probably need to subscribe to BP to get the relevant statistics.

by false cognate on Mar 25, 2006 4:13 PM EST up reply actions  

Comments For This Post Are Closed


User Tools

We use numbers and stuff.
Community Guidelines
Why be a member?

FanPosts

Community blog posts and discussion.

Recent FanPosts

Small
Context Neutral Run and RBI projections
Small
Free Agent Compensation
Img_0001_small
Value of Various Plate Approaches
Strike_three2_small
Effect of Foul Area on Strikeouts: AL 1954-68: Erratum
Small
Baseball on a stick
Small
Player Evaluating Statistic
Baseball_small
Rays Outfield: Cheap but Extremely Productive
Small
A new xBABIP
Small
Jack Morris "pitching to the score"
Strike_three2_small
Foul Area and Differences in SO: AL vs NL

+ New FanPost All FanPosts >

Follow us on Facebook!

Follow us on Twitter!

SaberGraphics

MLB Daily Dish

Get the latest MLB Trade Rumors, Transactions, and News at MLB Daily Dish!


Managing Editor:

Jbopp-kc_small Justin Bopp

Columnists:

Adam_small adarowski

Dme_small Satchel Price

Closeup4_small J-Doug

Carlosicon_small Julian Levine

Billy_and_daddy_4th_of_july_small Bill Petti

Featuring:

Dayton_small Jeff Zimmerman

12475953_small Jacob Peterson

Picture-6_small Chris St. John

Btbpro_small Dave Gershman

229331_10150183361996591_674441590_6760167_6637860_n3_small Lewie Pollis

Img_3830_small David Fung