lineup simulator in C
This is probably only of interest to a very few select people here, but I've finished the first version of my lineup simulator that I mentioned in this diary. The code is here. See this post for some suggestions and requirements for compiling.
Right now it seems to give consistently low numbers, on the order of 1 to 1.2 runs lower per game than the run estimator developed by Ken Arneson that was adapted over at Baseball Musings. I'm not sure why it's so low, so if anyone wants to look over my code and see if I'm not doing something really stupid, that would be nice. It's reasonably well commented so it shouldn't be too hard to understand, but feel free to ask questions. (I'm testing it on the core lineup of the 2005 Astros. I'm using aggregated stats for the pitching staff as one player.)
As far as performance goes, on a G4-based Mac Mini or PowerBook, it takes a little under 2.5 days to run the simulator for all 9! lineups. For any one particular lineup, the mean runs per game is calculated every 1000 games and inserted into a 100 element array (both numbers are arbitrary). When the standard deviation of the numbers in the 100 element array falls lower than 0.002 (i.e, the mean has stabilized), it moves on to the next lineup. This generally happens somewhere between 150,000 and 250,000 games, translating to something like 72 billion games total. I've tweaked the code to run as fast as possible; the only way to get a major increase in speed at this point would be to parallelize the code.
Update [2006-3-23 14:1:50 by false cognate]: Two reasons I've thought of that might be why it's so much lower than the actual run total for the Astros last year - the Astros were near the top of the league in steals, which aren't accounted for in my simulator, and they also had some significant platooning - Lamb and Palmeiro both had significant at bats (322 and 204 vs. Burke's 318 who is in my lineup) and both have better pop than Burke. However, this doesn't account for the differential with the run estimator at Baseball Musings.
0 recs |
17 comments
Comments
Hi FC.
by salb918 on Mar 23, 2006 2:17 PM EST reply actions 0 recs
Seems like you could fake SBs
When I made an abortive attempt to write a game simulator a while back, I did something like this:
- give every player a "steal rating" -- basically, if they have the opportunity to steal, what are the odds they'll try? --I don't remember how I calc'd that, but it seems pretty easy, and most players can just be set at 0.
- before processing each at-bat, check to see if there's a steal situation -- basically, just "if runner on first, no other runners" is good enough for now.
- then possibly attempt steal.
I don't write or read C, sadly. (I write in Python, my bestest friend.) I can follow the code well enough to see you haven't made any serious logical errors, though.
Hey! I might have just figured out a big part of the discrepancy: errors! In the 2005 NL, the difference between R/G and ER/G is .25. Not the whole thing, but a big chunk right there. I would imagine a good stealing team might give you another .05 or .1 ...sac flies as well.
Also, you could be a bit more liberal with extra bases on singles and doubles. I like your compromise for the sake of speed and simplicity, but if you want to go for upped accuracy, you could keep my slapdash speed estimator and let that determine the likelihood somebody goes for an extra base. Maybe that'd be good for another .05.
To try to reconcile with the actual total, you might try to create some sort of Lamb/Duke monster using their L/R splits or something that would reflect how well that lineup spot does on the daily basis.
Damn, this stuff is fun :).
---
http://www.BrewCrewBall.com
by jeffbcb on Mar 23, 2006 4:14 PM EST reply actions 0 recs
Player-dependent situations
As was pointed out to me before, Baseball Prospectus does in fact track DP opportunities for players, but I'm still new enough to the whole sabermetrics thing that I haven't convinced myself to shell out the money for a subscription.
Rolling Lamb and Burke's stats together might not be such a bad idea; it was some pretty weird platooning last year with Berkman/Lamb at 1st and Burke/Berkman out in LF with Palmeiro subbing all over the outfield. Defensive errors are something I hadn't even begun to think about, but yes, that does help account for about 25% of the discrepancy with respect to actual runs scored. I think steals and speed are probably a pretty significant portion as well.
If I can get the code parallelized, I might try rewriting the code to track which player is on which base and thus start including steals, GIDP, and sacrifices.
by false cognate on Mar 23, 2006 4:56 PM EST up reply actions 0 recs
ahh, ok.
I suppose you could take a half-step and work in the possibility of steals in a non-player-specific way, but that wouldn't shed much light on optimal lineup deployment. If Taveras leads off b/c he can steal, it wouldn't make much sense to basically award everybody on the team 11 SBs :).
Another thing you might consider to deal with the platoon thing would be, instead of using player names, use aggregate lineup deployment from the entire season. I'll bet you could get that data from Pinto's day-by-day database: instead of using Taveras as a possible person to plug into the lineup, use the stats that all Astros #1 hitters gathered when they batted leadoff. Same idea as you've done with pitchers.
Speaking of which: there's another .2 rpg (total WAG) or so: late-inning pinch hitters. You've got a 9th place hitter in the late innings performing like Clemens and Pettitte when it should be Palmiero. The aggregate approach would go a long way to solving that problem. An even better solution would be to split 9th (and maybe 8th?) place hitters by AB-in-game -- that is, have one set of stats for the 9th place hitter's first AB (usually pitcher), and so on. A half-way measure would be to split the game in halves...one of those two is probably necessary because if PHs are getting, say, 2 ABs/g, the 9th place hitter would look very good overall when they really shouldn't.
I hope that all make sense.
And I realize I'm throwing out a bunch of ideas that would all slow your program down a lot :).
by jeffbcb on Mar 23, 2006 5:35 PM EST up reply actions 0 recs
A faster way
by cephyn on Mar 23, 2006 11:01 PM EST up reply actions 0 recs
that's going to be
No matter how fast your database is, building a program around querying a database is going to be several orders of magnitude slower than just putting everything in memory and writing a specialized program in C. Memory is cheap - hell both machines I've been running this program on have at least a gig of RAM. And every last bit of speed counts if you're going to do all 9! permutations...
by false cognate on Mar 24, 2006 2:21 AM EST up reply actions 0 recs
well
by cephyn on Mar 24, 2006 12:01 PM EST up reply actions 0 recs
it's not a bad plan
by false cognate on Mar 24, 2006 12:25 PM EST up reply actions 0 recs
fun project
I notice you still have an if/else in there, moving it to switch/case might improve run time.
Good work though! I have some computers doing nothing at home i could set to task if you wanted some computational help on anything. i also have a dual core laptop thats pretty fast. 8)
by cephyn on Mar 24, 2006 5:11 PM EST up reply actions 0 recs
Like I said erlier
by salb918 on Mar 24, 2006 1:02 AM EST up reply actions 0 recs
almost sure he already put that in
by jeffbcb on Mar 24, 2006 1:36 AM EST up reply actions 0 recs
I did add this
by false cognate on Mar 24, 2006 2:22 AM EST up reply actions 0 recs
steal opps?
I think if someone actually tracks this stat, I could write a steal subroutine into my simulator with minimal problems...
by false cognate on Mar 25, 2006 2:33 AM EST reply actions 0 recs
I don't think anybody tracks it.
also, like I think I wrote earlier up there somewhere, this would help you get the run totals closer to correct, but it would probably take away from the value of the lineup simulator, given that part of the value of a proper lineup may be optimizing placement of high-steal guys.
I might be able to extract a rough number for you from '04 retrosheet logs...not sure what I have in what form right now.
by jeffbcb on Mar 25, 2006 7:53 AM EST up reply actions 0 recs
No outs
Before determining what the result of a particular at bat is, the algorithm would check to see if there was a man on first with second open (and third as well, if you believe the BP article). If so, it would then generate a random number and compare it against a calculated attempted steal probability [(SB + CS) / Steal opps]. If the steal attempt is made, it would then create another random number and compare against steal percentage [SB / (SB + CS)].
I think that it would actually improve the lineup simulator, because while the leadoff man is probably the most likely to be on first with no outs, and second and third open, if you calculate consider steal opps as only when there are no outs, it should help renormalize - i.e., while Biggio had only 11 steals (and 1 CS), he probably had far fewer steal opps than Taveras since he was generally second in the lineup last year. That is, successful steal percentage is probably a good measure of speed, total steal attempts is probably a poor measure of how often to attempt a steal because of the inherent lineup dependent nature of when you should attempt a steal.
by false cognate on Mar 25, 2006 12:56 PM EST up reply actions 0 recs
gotcha
As for the no outs part, I guess you have to decide whether you're trying to optimize lineup construction in an "ideal" world where Keith Woolner is managing, or you wan to optimize lineup construction for real-life managers. If the first, then yeah, the run-expectation table is the way to go. If the second, you'd probably want the rate at which steals are attempted for each base/out situation. (At least for each number of outs with runner on first, second open.)
by jeffbcb on Mar 25, 2006 2:39 PM EST up reply actions 0 recs
player-specific information
by false cognate on Mar 25, 2006 4:13 PM EST up reply actions 0 recs

by 











BtB on Facebook
















