Tango on Baseball Archives

By The Numbers - Sept 7 (September 8, 2003)

The new issue of SABR's By The Numbers is out, and it has a few interesting topics. It's a PDF file, and you are best to "right-click" the above link and "save target as". The file is under 300 KB.

An alternative to the log5 / Odds Ratio method to computing win% of 2 teams.
Forecasting batting average by using the "binary" approach that DIPS has made famous
An article with the following tag line: Cubs manager Dusty Baker recently sparked controversy when he asserted that black players perform better in hot weather than white players do. Here, the author looks for evidence of whether the assertion is, or isn’t, supported by the statistical record.
The theoretical underpinnings on Pythag. I would love it if someone where to apply what is presented here against the Tango Distribution (which is a more accurate model to run scoring than the Poisson Distribution). I'd also like to change that damn name, if someone can suggest something.
Shane Holmes' presents his Favorite Toy analysis which we've seen in some form or other here/fanhome

--posted by TangoTiger at 09:51 AM EDT

Posted 11:48 a.m., September 8, 2003 (#1) - reno dakota
So... does anyone have any ideas how to calculate for alpha in the log5 formula?

Posted 3:37 p.m., September 8, 2003 (#2) - bob mong
Good stuff - except for that pitch-count estimator. That's a load of crap :)

Posted 9:20 a.m., September 9, 2003 (#3) - Alan Jordan
What alpha are you refering to? The alpha from Skiena's formula for probability of one jai alai player beating another or are you talking about Cronbach's Alpha for internal consistency or some other alpha?

Posted 10:09 a.m., September 9, 2003 (#4) - Andrew Edwards
Heh. I expected the skin-tone thing to pretty completely rebut Dusty.

With that sample size, it neither rebuts nor confirms Dusty's hypothesis. But if I was going to present evidence that suggested any sort of difference between "races", I'd be damn sure my methodology was airtight. Doing otherwise borders on the irresponsible.

Incidentally, if the effect seen there is real, I'd say it was much more likely attributable to players who were raised in hot climates performing better in hot conditions than to anything about genetic traits.

Posted 12:45 p.m., September 9, 2003 (#5) - GeorgeOJungle
I hate to say this but I cringed when I realized how small the sample size was. I may have misread it, but it appeared that the reviewers were the ones who attempted to do the hypothesis tests. Not a good situation.

Posted 1:48 p.m., September 9, 2003 (#6) - reno dakota
Sorry-- I'd like to figure the alpha for the Skiena formula. I've also never seen the Bill James log5 formula, so if anyone has data on that I'd be interested. Another question: What do people think about using BP's third-order win% in lieu of actual win% for these formulae?

Posted 1:50 p.m., September 9, 2003 (#7) - reno dakota(e-mail)
To clarify: I am assuming the alpha for Skiena's formula is different in baseball than in jai alai. I'd like to apply Skiena's formula to baseball and see how well it holds up. Guidance on how to calculate that alpha would be much appreciated.

Posted 3:31 p.m., September 9, 2003 (#8) - tangotiger
Keith Woolner has some good data at BP from a few years ago. Do a search for "Lumina", and you should get it.

Posted 3:35 p.m., September 9, 2003 (#9) - tangotiger (homepage)
Actually, the data you want can be found at the above link from diamond-mind.com .

Posted 6:32 p.m., September 9, 2003 (#10) - Alan Jordan
Reno -

David Massey has fairly up to date game by game data for 2003 at

http://www.masseyratings.com/data/mlb.gms

Without having read the book, I can't tell you how Skiena derived his alpha exactly. However, when I do this sort of work I generally use nonlinear least squares. You can also also use reweighted nonlinear least squares or if you have the programming skill and the likelihood function at hand, maximum likelihood. I have SAS so I can use nonlinear least squares in proc nlin. If you don't have SAS or SPSS or some stat package that can do nonlinear equations then you have to have to know how to program it.

As an aside, it would appear than Skiena's formula needs to be generalized to accomadate other factors such as parks, homefield advantage and leauge average, not to mention multinomial outcomes.

As for BP's 3rd order win%, I wouldn't rely too heavily on anything that uses aggregated season total runs or events such as hits because strength of schedule and park factors can't be removed. Davenport doesn't explain what he's doing on 3rd order. He explains 1st and 2nd order and they are definitely season aggregated.

I pretty much slammed Davenport's system in a thread called "Tigers winning percentage inflated?" at fanhome.com. It retrospect I was probably a little too harsh on him, but I was appalled that someone who I thought had access to game level data wasn't taking advantage of it. Rob Neyer isn't any better and I KNOW he has access to game level data, but he still uses the pyth with seasonally aggregated data. All the others I have seen that use event data such as hits seem to use seasonally aggragated data. The problem appears to be that there is no current and up to date source that has game level data with hits that people can use to build better models. Once a source such as this becomes available, Davenport's 1st and 2nd order winning percentage models will be truly obsolete.

Posted 1:30 a.m., September 15, 2003 (#11) - Brad
You didn't 'slam' his 3rd order win percentages; you were publicly humiliated. Davenport's goal is purely predictive, making aggregation perfectly acceptable, since hitting events are, in general, relatively independent of one another.

Posted 1:08 p.m., September 16, 2003 (#12) - David H
Does anyone have any ideas on what exponent to use in the Jai-Alai formula, if .4 doesn't provide appropriate predictive winning percentages for baseball?

Posted 5:25 p.m., September 17, 2003 (#13) - tangotiger
That Jai Alai equation makes no sense from what I'm looking at.

Plug in a .600 team against a .500 team, and the result SHOULD be .600, but it is nowhere close to that.

As it stands now, the best method to use is the Odds Ratio method. Maybe I'll put up a Javascript program so that people can use it.