Individual Poster Page

See copyright notice at the bottom of this page.

Golf - player of the year (November 10, 2003)

Posted 9:40 p.m., December 9, 2003 (#9) - AED
The scoring average used by the PGA adjusts for the mean score of each round, but I'm fairly sure it does not adjust for the level of difficulty. By and large this doesn't matter, except for limited-field events like the Mercedes, Tour Championship, and WGC events (or the leftover events like the Tucson Open that are played on the same weeks). Because Woods preferentially plays in the elite events, his actual skill is slightly better than is indicated by the scoring average (by about 0.15 strokes per round).

The "replacement level" would be around $500,000, if you define replacement level as earning enough to keep your card (top 125).

Win and Loss Advancements (November 13, 2003)

Discussion Thread

Posted 11:00 p.m., December 9, 2003 (#47) - AED
I think it's an issue with the definition of 'replacement level' in win shares. A player who is a replacement-level hitter probably only plays in the majors if he's well above a replacement-level fielder. Likewise, a replacement-level fielder is probably a decent hitter. So a true replacement-level position player is better than someone who is both a replacement-level batter and a replacement-level fielder. Since Win Shares are calculated separately for batting and fielding using the replacement level of each, the values of postion players are inflated.

Baseball Player Values (November 22, 2003)

Discussion Thread

Posted 9:17 p.m., February 1, 2004 (#27) - AED
It means there is an error of 0.0002 wins per at-bat in favor of pitching. Hardly a serious implication.

Marcel, The Monkey, Forecasting System (December 1, 2003)

Discussion Thread

Posted 6:14 p.m., December 5, 2003 (#33) - AED
I think the intention here is to make the system as simple as possible. As such, Tango has done a great job.

The variance measurement for batting averages should use the average batting average, not the per-season batting averages. The reason being that random variations significantly outweigh actual changes in ability (aside from the basic age adjustment). The beauty of this is that the same BA*(1-BA) factor appears in all variances, and thus divides out to give exactly
average = sum(AB*BA) / sum(AB).

Strictly speaking, the weighting factors for various years should themselves be functions of the number of at bats. The reason being that this is just an additional source of variance: the year-to-year fluctuation in batting average is a combination of random noise, changes due to average aging, and unmodeled (effectively random) changes. If the random noise is small (many at bats), the amount of variance due to the unmodeled changes is proportionally quite large. If it is large (few at bats), the variance due to unmodeled changes is negligible. In general, the weight for one year's stats would equal:
AB/(1+x*AB*dy)
where AB is the number of at bats, x is related to the year-to-year variance of the skill in question, and dy is the number of years between the year being projected and the year whose stats are being looked at. This is something to worry about in a more complex system, of course; Marcel works fine because the situation most problematic (player doesn't have many at bats in any season) is mitigated by the use of the prior.

The regression of players in the way used here can be very accurate. If player abilities and random errors are distributed normally, in fact, weighting in this way is exactly the same as making a probability analysis. Since the distributions are moderately close to Gaussian, you won't improve the projcections noticeably using a more formal analysis.

The one catch is that, by regressing to the league average, Marcel only works correctly for starters. Overall, MLB players regress to the average batting average of MLB players, not to the league average. This is significant, because there more players with 100-200 at bats (averaging 0.240) than there are with over 500 at bats (averaging 0.285). Because the regression effect is most significant for players with few at bats, I would strongly suggest changing this factor. At minimum, you should determine the relationship between batting average and number of at-bats per season, and use the appropriate value to regress. That makes things a little more complicated, but will greatly increase the accuracy of projections of part-time players.

Marcel, The Monkey, Forecasting System (December 1, 2003)

Discussion Thread

Posted 4:18 a.m., December 7, 2003 (#39) - AED
Where do you get this:

AB/(1+x*AB*dy)?

Alan, I assume that year-to-year variations of a player's true ability, after making average age corrections, are a random walk. So the variance between the 2003 and 2004 abilities will equal x, that between 2002 and 2004 abilities equals 2x, and that between 2001 and 2004 abilities equals 3x. I also multiply the variance by r(1-r), where r is the rate in question, which makes things easier and intuitively makes smaller year-to-year variations when r is close to zero or one.

Putting this together with the variance from the binomial distribution, I get a total random variance of
r*(1-r)/AB + r*(1-r)*x*dy = r*(1-r)/AB * ( 1 + x*AB*dy) ,
where the first term is from the binomial distribution and the second is the random walk in unmodeled year-to-year variations. If there is no prior, the r*(1-r) factor cancels out since it is present in all weights, and the individual terms are weighted by
AB/(1+x*AB*dy).

If there is a prior and you want to keep Marcel simple, you need to fudge by assuming some value of r*(1-r) in the weighting. I'm also fudging the corrections for average aging, of course, which technically should be done on the model value rather than an adjustment to the stats. (Doing the latter screws up the variances ever so slightly.) If you're trying to project stats as accurately as possible to run a major league team, you'd worry about this. If you want a reasonably accurate forecasting system that can be explained in a few sentences, you don't.

Can you expand on this too?

"If player abilities and random errors are distributed normally, in fact, weighting in this way is exactly the same as making a probability analysis."

This is pretty straightforward. Paraphrasing Bayes' theorem, the probability of a player's true (inherent) batting average skill being x equals the probability of his having had his particular batting stats over the past N seasons times the probability of any player like him having a skill of x. If both of these are approximated using Gassian probability distributions, -2*ln(P) equals:
(x-m)/s^2 + sum(i=years) (x-xi)^2/Vi
where m is the mean of "players like him", s is the standard deviation of that group, xi is the player's rate stat (adjusting for average aging) in year i, and Vi is the random variance (Binomial plus random walk aging) in year i. This, of course, simplifies to
x^2*(1/s^2+sum_i(1/Vi)) - 2x*(m/s^2 + sum_i(xi/Vi)) + constants
Solving for the value of x that maximizes the probability (or minimizes this function of -2*ln(P)), you get:
x = (m/s^2 + sum_i(xi/Vi)) / (1/s^2+sum_i(1/Vi))
which is *exactly* a weighted average of the player's past stats and a regression factor. The weights equal 1/Vi for the past stats and 1/s^2 for the regression to the mean.

Marcel, The Monkey, Forecasting System (December 1, 2003)

Discussion Thread

Posted 10:27 p.m., December 8, 2003 (#45) - AED
Tango, the rr/(1+rr) formula is the same as the one I posted on the DRA part 3 thread. rr equals the variance of the player ability distribution divided by the variance from random noise, the latter being proportional to the number of chances.

I tend to think in terms of probabilities, so it doesn't really matter if you use rates, ratios, ln(ratios), or any arbitrary function of the rate. The only difference is that calculations are easier in some scales than others. In this case, I think that it's easiest to stick with the rates, since the binomial statistics dominate the variance, and variance is trivial to calculate if your scale is a rate.

Marcel, The Monkey, Forecasting System (December 1, 2003)

Discussion Thread

Posted 10:52 p.m., December 8, 2003 (#46) - AED
Alan, about your earlier post... I think the difference is that I'm modeling the autoregression component as a combination of average aging trends and an unmodeled random walk. I don't attempt to model the random walk element, instead treating it as an additional source of variance equal to x*dy in the performance for the year in question (as viewed from the year being forecast). I don't think you would measurably improve the forecast accuracy by using a more sophisticated approach, since it is a small factor compared with the random noise. Also, by modeling in this way, I can treat the yearly stats as being statistically independent.

Marcel, The Monkey, Forecasting System (December 1, 2003)

Discussion Thread

Posted 12:10 p.m., December 10, 2003 (#48) - AED
Alan, this is what the x*dy term accomplishes. Recall that my random variance is set equal to r*(1-r)*(1/AB+x*dy), where again r is the "true score" and x is an arbitrary constant. (Actually I use something a little more complex than x*dy to make that term a little more accurate, but that's not the point here and x*dy is a pretty good approximation.) The other element of the correlation coefficient is the variance in inherent abilities, is a constant. The correlation coefficient, of course, is related to the ratio of the two variances, so it is indeed a function of dy.

There are several ways you can handle the element I model as x*dy. One, you can ignore it altogether, which as you note is quite unrealstic. Two, you can actually model it as some sort of random walk whose year-to-year variance equals x. Three, you can model the dy-year variance term as I do (x*dy), which correctly models the variance but doesn't attempt to model the walk itself. In other words, I just care that the variance between 2000 and 2004 is 4x and that between 2001 and 2004 is 3x; I don't take advantage of the additional knowledge that the variance between 2000 and 2001 is only x. Because the variance from random noise is much larger than x, I don't mind the tradeoff. I could be wrong though; I'll have to take a look at it.

Marcel, The Monkey, Forecasting System (December 1, 2003)

Discussion Thread

Posted 3:24 p.m., December 10, 2003 (#50) - AED
I didn't enter the 2003 forecasting contest, but running the numbers my average "accuracy" for the 28 players (12*relative OPS error or relative ERA error) was 0.623 using a weighted average with a career baseline and 0.633 using a PECOTA-like comparable player test with a 5-year baseline. Part of the increased accuracy is probably because I include park and league adjustment factors, although I didn't pay attention to see how many players had changed teams. Using a longer baseline also helps the accuracy. On the flip side, I didn't regress to the mean in those projections, which probably adds enough inaccuracy to offset the accuracy gained from the park/league adjustments and extended baseline.

Regardless, my projection isn't significantly more accurate than Marcel, since the dominant source of projection inaccuracy is random noise in the 2003 stats (about +/-0.054 in OPS for 600 PA; about +/-0.63 in ERA for 200 IP). I think a perfect projection, with all abilities known exactly, would have had an average accuracy of around 0.55.

Professor who developed one of computer models for BCS speaks (December 11, 2003)

Discussion Thread

Posted 5:37 p.m., December 11, 2003 (#14) - AED (homepage)
The BCS has not done a very good job of selecting computer rankings on the basis of merit. The fact that a win over a I-AA team is ignored, while a win against a weak I-A team can lower your ranking is absurd. That wouldn't really have mattered here, though, since better computer rankings also put Oklahoma #1, LSU #2, and USC #3.

Only three of the ranking systems use statistics in any meaningful way. Sagarin's system is based on the Elo chess system, in that it finds the set of team ratings such that every team's record is "predicted" properly. (The cumulative odds of winning each game equals the team's number of wins.) This is similar to the KRACH system. Massey's and Wolfe's ranking systems are based on more accurate maximum likelihood models. All regress to the opponent strength rather than to the mean. This is somewhat problematic in that a win over a really bad team will lower your rating, albeit only by a tiny amount. Also, Massey and Wolfe do not consider home field, so overrank Ohio State (8 home games, 4 road games).

I've spent several years working on this sort of stuff; my homepage link has details on correct Bayesian treatments, both with and without margin of victory considerations.

Professor who developed one of computer models for BCS speaks (December 11, 2003)

Discussion Thread

Posted 2:09 a.m., December 12, 2003 (#17) - AED
Alan, I think you're mixing up the details of Massey's two rankings. While he uses a fairly sophisticated homefield treatment in his main ranking, he does not use homefield effects at all in his BCS ranking.

The variance in score difference is proportional to the sum of the scores - something that is true in every sport and allows better rankings than what one gets from Pythagorean (or related) systems. From the link Alan provided, it seems that Massey's game outcome function assumes the variance is proportional to the square root of the sum of the scores (the standard deviation goes as the fourth root), which doesn't sound right.

As for Colley's system. Each team's rating equals its wining percentage plus the average of its opponents' ratings minus 0.5, with all teams additionally considered to have won and lost to an average team (rating=0.5). It would be fairly trivial to build in a homefield factor to this by adding another linear equation where the homefield factor is set equal to the winning percentage of the home team plus the rating difference between road and home teams minus 0.5. It's probably not worth the effort; it's sort of like arguing about 1.6OPS vs. 1.7OPS when linear weights is vastly better than either.

Professor who developed one of computer models for BCS speaks (December 11, 2003)

Discussion Thread

Posted 10:08 a.m., December 12, 2003 (#20) - AED
Alan, the process I described for adding homefield advantage to the Colley Matrix would just add one more variable and one more linear equation. So yes, it could still be solved by matrix inversion.

Tango, there is no way to build an effective probabilistic win-loss rating without using a prior of some kind. The odds of all results this season having happened are maximized when all undefeated teams have ratings of infinity (and winless teams have ratings of negative infinity), since only then is are the probabilities of those teams' wins (losses) maximized at exactly one. A team's games against itself don't change anything, since those probabilities are constant as ratings are changed. (By crediting a team with a win and a loss against itself, you multiply the total season probability by 1/4, regardless of the team's rating). Likewise, a team whose only loss was to an unbeaten would gravitate to a rating of 'half infinity', which gives 100% probability of its wins and losses having happened.

Failing to use a prior also violates Bayes' theorem, in that it ignores the fact that it would be quite unlikely for a team to score 1000 on a rating system that has never created a team rating over 10.

Professor who developed one of computer models for BCS speaks (December 11, 2003)

Discussion Thread

Posted 12:15 p.m., December 12, 2003 (#25) - AED
Jim, the BCS includes only 'retrodictive' rankings. Predictive rankings are more accurate in terms of predicting outcomes of future games, but the best ones rely exclusively on game scores (no win-loss bonus) would give #1 vs. #2 pairings that are unacceptable to the general public. For example, the 2002 title game would have been Kansas State vs. USC, despite a couple losses for both teams. Sagarin and Massey both have predictive rankings, but their rankings used by the BCS are entirely retrodictive.

The schedule strength issue is a thorny one. A real-life example was Alabama a few years back, which was ranked #3 in the preseason and went on to have a dismal season. Their first opponent was a Pac-10 team, I think USC, and the poll voters gave USC a ton of credit for beating Alabama. The final computer rankings were less impressed, because they evaluated Alabama as they really played rather than how good they were perceived to be at the time of the game. So yes, you really do have to use all games an opponent plays - including those played later - to make an accurate statistical evaluation.

Professor who developed one of computer models for BCS speaks (December 11, 2003)

Discussion Thread

Posted 12:33 p.m., December 12, 2003 (#26) - AED
Tango, assuming you are asking about a KRACH-like system, the basic idea to find the set of team ratings such that each team's expected number of wins equals its actual number of wins. The expected number of wins is the sum of the probabilities of winning each game. In other words, if your "trial" rating has team A one sigma better than team B, team A gets 0.84 expected wins and team B gets 0.16 expected wins.

The infinity problem comes in if a team is unbeaten, in which case you need to find the set of ratings such that its probability of winning each game equals one. Because even a 10-sigma mismatch has a nonzero chance of an upset, one only finds a win probability of exactly one for all of a team's games if its rating is infinitely better than that of any of its opponents.

Having that team play itself to a tie N times (or winning N/2 times and losing N/2 times) doesn't affect this, since you're adding N/2 to both sides of the equation (the probability of a team beating itself exactly equals 1/2). Thus you would still need the probabilities of winning each of its real games to all equal one, which again is only the case if it is infinitely better than its opponents.

Instead you have to do one of the following. The KRACH system adds N (N=1?) ties against an average opponent to the team's rating. Sagarin's system appears to credit a win as slightly less than one win (perhaps 0.97 wins) and a loss as slightly more than zero wins (around 0.03), or something to this effect. This is equivalent to using a prior in a probabilistic model, and is equivalent to regression to the mean.

The main drawback to this sort of model is that ignores details of which teams you beat and which you lost to. All it tries to do is match the actual and expected wins, but the specific details of which games were won and lost gives additional information about a team's quality.

Diamond Mind Baseball - Gold Glove Winners (December 11, 2003)

Discussion Thread

Posted 2:25 a.m., December 12, 2003 (#8) - AED
supposedly this works...

Pythag - Ben VL (December 12, 2003)

Discussion Thread

Posted 5:04 p.m., December 17, 2003 (#6) - AED
FJM, your "IF AND ONLY IF" assertion is far too strong. Regular season stats have very good predictive value regardless of whether or not pitching matchups are considered. Sure, one can improve the predictions by accounting for pitching. One can improve them further by considering team platoon splits, health of position players, home field advantage, or any number of other considerations. Given that we don't have lineup and pitcher data for most of baseball's history, it would be sad to forgo analysis altogether because the data aren't up to one's standards.

Converted OBA (December 15, 2003)

Discussion Thread

Posted 4:08 p.m., December 15, 2003 (#1) - AED
Tango, this doesn't quite work out the way you think. The problem is that the variance of your effective OBA does not equal effOBA*(1-effOBA)/PA, which is necessary to apply binomial statistics.

Consider two extreme examples that can be treated with binomial statistics. Player #1 never swings the bat, so his at-bat results are either walks or strikeouts (or HBP, but we're not counting those). Letting "x" equal the fraction of plate appearances in which he walks, the variance in the walk rate equals x*(1-x)/PA. Since a walk is given a weight of 0.72 in your system, the player's effOBA equals 0.72*x, which means the variance of effective OBA equals effOBA*(0.72-effOBA)/PA.

The opposite extreme example is a player who swings at every pitch, and either whiffs or hits a home run. Again, the variance in his home run rate equals x*(1-x)/PA. Adopting the weight of 1.95 for a home run, the variance in his effective OBA equals effOBA*(1.95-effOBA)/PA.

Giving both players typical effOBA values of 0.340, the variance in the walker's effOBA is a factor of four smaller than the variance in the slugger's effOBA. Obviously the example is an extreme case that would never occur in reality, but it does illustrate the point that the variance will be higher for power hitters than for singles hitters, which violates the necessary requirements for use in a binomial distribution. (Another more straightforward problem is that there is a nonzero chance that a player's effOBA could exceed one, which would really cause problems with the statistics.)

Something along these lines can be used to improve EqA, although of course one might as well do it right if making the effort to compute EqA.

Converted OBA (December 15, 2003)

Discussion Thread

Posted 6:42 p.m., December 15, 2003 (#6) - AED
Here are some examples. I've used a Monte Carlo test to compute the variances per AB+UIBB, so it's imperfect but should be pretty close. I don't have ROE data handy, so I've limited the test to the other categories (singles, doubles, triples, home runs, walks).

effOBA variance smp.var player
0.495 0.388 0.250 Bonds
0.324 0.201 0.219 Pierre
0.279 0.260 0.201 T Batista
0.162 0.136 0.136 average pitcher
"variance" denotes the actual variance in effOBA using last year's stats as the model; "smp.var" denotes the variance one would assume using binomial statistics and effOBA as the rate. Variances are per plate appearance, so the variance in N plate appearances equals the value divided by the number of appearances.

As you see, the relationship between actual and assumed variances is not a simple function of effOBA, but depends on how much of the player's contribution to effOBA is a function of extra base hits.

Converted OBA (December 15, 2003)

Discussion Thread

Posted 6:56 p.m., December 15, 2003 (#7) - AED
I seem to have interpreted your intention of using this number with "probability distributions" literally. In the way your example would use this - an application that involves only the mean, not the variance - I suppose it should work out OK.

Converted OBA (December 15, 2003)

Discussion Thread

Posted 8:32 p.m., December 16, 2003 (#15) - AED
The reason your finding looks odd is that no MLB pitcher who faced 600+ batters last season gave up twice the league average number of home runs. Helling was the only pitcher in 2003 who faced 600+ batters and gave up more than 150% the league average number of home runs per batter faced.

A more detailed ability analysis appears to confirm the result from the "Ben matchup" system of around 60 HR in 600 PA. In a nutshell, you assume that the odds of a batter with home run ability of "X" hitting a home run off a pitcher with home run prevention ability of "Y" equals:
max * (1+erf(X-Y))/2
where max is the maximum rate for a great slugger and a horrid pitcher, and erf(x) is the error function as defined in C; (1+erf(x))/2 is the normalized probability from negative infinity to x. X and Y are computed from the observed rates, accounting for the distribution of opposing pitching/batting skills faced, park effects, and so on. (I'm leaving out a lot, of course, since it's sort of off-topic.)

The problem that FJM brought up is because "Ben's method" doesn't handle the rates correctly. You actually need to convert the line to rates, combine the rates, and convert back to a line. In other words, all plate appearances can be divided into walks, strikeouts, and batted balls, so there are three rates there (BB/PA, SO/PA, batted balls/PA). Apply Ben's method to the three rates, normalize to one, and distribute the 600 PA among the three categories. Batted balls can be broken down into home runs, outs, and non-HR hits (or a two-step process involving the HR rate and then the non-HR hit rate). Again, apply Ben's method to the three rates, normalize to one, and distribute the batted balls among the three categories. Finally, for non-HR hits, divide among singles, doubles, and triples (or a two-step process with the XBH rate and then the triple/double ratio).

I don't have time to run through it now using the original example and FJM's example, but this will fix any problem. I should note that the batting average from the matchup need not equal the league batting average, even if the batter and pitcher both have the league averages.

Do Win Shares undervalue pitching? (December 15, 2003)

Discussion Thread

Posted 7:09 p.m., December 15, 2003 (#4) - AED
It really has nothing to do with either 1.52/0.52 or the Fibonacci values. The real question is how many runs a replacement-level player would cost the team at each position. Consider baseball vs. slow-pitch softball. Assuming that pitching is important in baseball and unimportant in slow-pitch softball, you would find that (compared with baseball) "win shares" for a slow-pitch softball league should be awarded more heavily for offense than defense -- say, 1.30/0.30.

The overall assignment of Win Shares on offense and defense seems reasonable, but the problem is that batting and fielding replacement levels are not independent. A player who is both a replacement-level fielder and a replacement-level hitter would be below replacement level overall. Actual replacement level is only slightly worse than replacement-level hitting combined with average fielding.

Do Win Shares undervalue pitching? (December 15, 2003)

Discussion Thread

Posted 3:38 p.m., December 16, 2003 (#21) - AED
Tango, you are implicitly assuming that a team with average offense and replacement-level defense would have the same record as one with replacement-level offense and average defense. I don't see this as a given. In my example of slow-pitch softball, the wins cost by replacement-level defense would be less than the wins cost by replacement-level defense in baseball. Requiring the two numbers to be reciprocals of each other is only valid if you believe that replacement-level defense is as bad as replacement-level offense.

I disagree that 0.62/1.62 (or 0.52/1.52, 0.50/1.50, etc.) is a reasonable definition of replacement level. Replacement level is any two values that would produce a Pythagorean (^2, ^1.8, or whatever you prefer) win percentage of 0.30. If you believe that replacement-level offense is equally bad as replacement-level defense, you end up with something around 0.80/1.25. If you're playing home run derby (all hitting), it's 0.62/1.00. Regardless, the problem here is that you have to decide if you're trying to compute marginal wins relative to a zero-level team or relative to actual replacement level.

Fundamentally, I think it's pointless to try to quantify specific replacement levels for fielding, pitching, and hitting. A replacement-level player is one whose overall contribution is at the replacement level, not whose fielding, batting, or baserunning skills specifically are at replacement level. Unless the correlation between fielding skill and hitting skill is exactly one, the overall replacement level at a position is better than the combination of replacement-level hitting and fielding.

Do Win Shares undervalue pitching? (December 15, 2003)

Discussion Thread

Posted 6:49 p.m., December 16, 2003 (#24) - AED
Yes, there's a big difference between whether you are measuring contributions above replacement level or contributions above the Bill James baseline. Replacement level is something that can be easily determined by averaging the typical contributions from replacement-level players. More to the point, replacement level actually means something in terms of a player's value. If a player is below replacement level, he has zero value since you can find someone better to replace him without too much difficulty.

So I'm not sure why one would prefer Win Shares instead of value over replacement, but back to that topic. The conversion from Pythagorean winning percentage to run differential is merely a first-order Taylor series approximation:
N*(s^2)/(s^2+a^2) = (s-a+N*m)/(2m) = (s-N*bs)/2m + (N*ba-a)/2m
where N is number of games, s is runs scored, a is runs allowed, m is the league average per game, and bs and ba are the baseline values for runs scored and allowed. This approximation is only valid where the winning percentage is between 0.3 and 0.7, not over the entire range of possible values. The baseline values of ba and bs are set where the winning percentage is zero in the approximation but much higher than zero (about 15%) in the Pythagorean equation. Therefore, using the Pythagorean equation to find the baseline values is not correct. Instead, you should look back to the the fact that the Taylor series approximation is centered on the league average, where a run scored is equally valuable as a run prevented. As such, the most reasonable choice for baseline values is 0.50/1.50.

Adopting Tango's 70/30 split for defense, this would give 50% offense, 35% pitching, and 15% fielding. However, to avoid the double-counting of "value over replacement" problem I mentioned before, one might assume that the variance in offense and fielding is largely independent. In other words, the total variance from 50% from one variable and 15% from another is similar to 52.2% from only one variable. This is equivalent to giving position players 60% of the win shares and pitchers 40%. Dividing the position players' contributions back for the right offense vs. fielding breakdown, this means that win shares should be 46% offense, 40% pitching, and 14% fielding. This means baselines of 0.54/1.54 and a pitching/fielding split of 0.74/0.26.

Do Win Shares undervalue pitching? (December 15, 2003)

Discussion Thread

Posted 6:45 a.m., December 17, 2003 (#29) - AED
Studes, sorry for making a few leaps of logic. I'll step through it.

1. The approximation of Pythagorean wins to a linear function of run differential is centered at 50% wins (zero run differential), so by definition the marginal win value of a run scored equals that of a run prevented.

2. Win Shares per skill (offense, defense, or fielding) are awarded proportionally to the standard deviation in wins gained or lost because of that skill. This is pretty straightforward, as it means that if you are comparing batters to 2-sigma bad batters to compute their values, you should also be comparing pitchers to 2-sigma bad pitchers and fielders to 2-sigma bad fielders.

Putting 1 and 2 together, one concludes that Bill James, at least, feels that the standard deviations of offense, pitching, and fielding have a ratio of 52:35:17. Since most of offense and fielding are from position players, one can estimate that the ratio of standard deviation of total position player contributions to that for pitchers is 54.7:35, or 6:4. This means that pitchers really should have 40% of the win shares.

Dividing the 60% of win shares given to position players back up into 75% offense and 25% fielding, the final distribution of win shares is 45%, 40%, and 15%. (This works out differently from the previous example since I'm using James' breakdown of pitching/fielding rather than Tango's. For Tango's breakdown the final distribution is 46/40/14.)

This isn't quite right, though, because a standard deviation of 7 in pitching and of 3 in fielding gives a defensive standard deviation of 7.6, rather than 10 (a fact I overlooked in my earlier post). Accepting that the standard deviation of runs scored equals that of runs allowed, let's call the ratio of offense to defense 50:50. If pitching and fielding are statistically independent (mostly true), a 7:3 ratio of pitching:fielding standard deviations would give an offense:pitching:fielding ratio of 50:46:20 in standard deviations. Again assuming that offense and fielding are uncorrelated, this gives a ratio of 54:46 for position players to pitchers; breaking down the position player standard deviation into a 5:2 ratio of offense and fielding would distribute win shares by 39% offense, 46% pitching, and 15% fielding. In other words, the .52/1.52 baseline should become .61/1.61 (albeit for entirely different reasons from Tango's Fibonacci argument), and the average distribution of defensive win shares should be 75% pitching and 25% fielding.

With a win advancement program, you can calculate these values directly as well as check for correlations between team fielding, batting, and pitching. My calculations of this sort using Retrosheet's 1991 and 1992 data indicate that win shares should be distributed something like 37% offense, 45% pitching, and 18% fielding.

I'm not sure which breakdown is better, but it does seem that pitchers deserve much more than the 35% of win shares they are currently allocated -- something closer to 45-46% would be right.

I'm not sure Win Shares is worth spending any more time tweaking, aside from the overall distributions. As Michael and many others have pointed out, there are serious problems throughout, so it's useful only as a nifty toy. If you really want absolute wins, you should compute wins contributed relative to replacement level and then sprinkle the remaining wins among a team's players using 39/46/15, 37/45/18, or some other reasonable percentage of offense/pitching/defense.

Do Win Shares undervalue pitching? (December 15, 2003)

Discussion Thread

Posted 2:28 p.m., December 17, 2003 (#34) - AED
Right. If you have two independent variables with standard deviations of x and y, the standard deviation of the sum of the variables equals sqrt(x^2+y^2).

So in this case, sqrt(52^2+17^2) = 54.7.

Do Win Shares undervalue pitching? (December 15, 2003)

Discussion Thread

Posted 3:29 p.m., December 17, 2003 (#37) - AED
Tango, that's not right, unless there is a perfect correlation between hitting, fielding, and baserunning. In reality, the correlations are close to (but not exactly) zero. This means that the hitting, fielding, and baserunning standard deviations should be added in quadrature; this makes a ratio of sqrt(41^2+16^2+4^2) = 44% nonpitchers and 38% pitchers. So pitchers get 46% of win shares and nonpitchers get 54%.

Do Win Shares undervalue pitching? (December 15, 2003)

Discussion Thread

Posted 4:54 p.m., December 17, 2003 (#39) - AED
Guy, this is what I did in my earlier post (#29). I mentioned that Win Shares seems to be based on the assumption that 52:35:17 is the ratio and ran through the math (to answer Studes' question), but then corrected for the statistical independence of offense and fielding, which is how I got 50:46:20. This results in pitchers getting 46% of the win shares.

In the same post, I also used actual team-level SDs of the factors computed using win advancements to estimate 45% pitching Win Shares and essentially confirm the result obtained from assuming equal offense and defense and a 70/30 pitching/fielding division of defense.

Request for statistical assistance (December 17, 2003)

Discussion Thread

Posted 2:45 a.m., December 20, 2003 (#15) - AED
Tango - The short answer is that, for your purposes, multiplying the difference from league average by the variance ratio will give a reasonable regression to the mean, so long as all players caught similar numbers of innings. If there are wide differences in number of innings you have to do a little more work.

For the values in post #9, the observed variance is 4 (2^2) and the theoretical random variance is 1 (1^2), so you would regress by 1/4 towards the mean.

Request for statistical assistance (December 17, 2003)

Discussion Thread

Posted 11:18 p.m., December 20, 2003 (#18) - AED
Here is the easiest way I can think of to do this empirically.

Divide the catchers into groups based on number of plate appearances. For each group, calculate the variance in each statistical category, as well as the random variance. Regress each catcher's value towards the group average (not the overall average) using:
value = player + (average - player) * (random variance) / (actual variance)
This saves a lot of work, and allows you to regress each player to the average of similar catchers.

If you want to do a little more work, you could determine the average and variance ratio as a function of typical innings caught per group. These should be reasonably smooth functions, and you could regress each catcher to his own mean using his own regression amount. The variance ratio (assuming you are working in rates rather than totals) should go something like 1/(1+a*n), where a is a fixed constant and n is the number of innings caught or plate appearances.

All of this assumes that you can calculate the random errors, of course. I assume you are, but want to make sure you take into account the error contributions from the fact that "pitcher with other catcher" has random and systematic offsets from the true desired baseline of "pitcher with average catcher".

Request for statistical assistance (December 17, 2003)

Discussion Thread

Posted 2:27 a.m., December 22, 2003 (#21) - AED
If you're having difficulty with the variances, I'd suggest approaching it from the opposite direction:
value = average + (player - average) * (intrinsic variance) / (actual variance)
The "actual variance" is the square of the observed standard deviation of the group of players, and the "intrinsic variance" is the variance due to player abilities. I would also suggest keeping this purely in rates (per PA) until the very end, at which time you can multiply by some number of PA per 162 games.

Calculating "intrinsic variance", which I'll abbreviate "ivar" is not too tough. For one pitcher, the variance among rates per catcher equals:
variance = <c/npa> + ivar,
which can be rewritten:
ivar = variance - <c/npa>
"c" is the random variance per PA and equals the pitcher's career rate times (1-pitcher's career rate). In other words, if 1% of plate apperances have wild pitches, c equals 0.01 * 0.99. The average is taken over all catchers to have caught the pitcher for some minimum number of plate appearances (you choose the cutoff -- too low makes for more random noise; too high can cause biases to creep in), and "npa" is the number of plate appearances each catcher had with the pitcher. You calculate the variance directly from the rates those same catchers had while working with the pitcher.

Running through this process for all pitchers to have worked with at least, say, 5 or 10 catchers for significant amounts of time, the equation above becomes:
ivar = < variance - <c/npa> >
The outer average is taken over the pitchers; the inner average is taken over each pitcher's catchers. So run through this process to calculate a single value of ivar for the rate in question.

As noted in my earlier post, you should group catchers by career number of plate appearances. The "average" and "actual variance" of the rates for each group are determined separately, and combined with "ivar" from above will give the group's regression to average.

There is a MUCH more difficult way you could approach this, modeling the variances of the catcher ratings directly, but I doubt you would gain much over this technique (aside from a splitting headache).

UZR, 2000-2003, Adjusted by Difficulty of Position (December 21, 2003)

Discussion Thread

Posted 3:51 a.m., December 22, 2003 (#9) - AED
Tango, on the other thread about position difficulties you ended up using UZR runs per play, rather than total UZR runs. I think it would be more useful for players in this table to be shown in UZR runs per play instead of per 162 games; otherwise good shortstops and bad first basemen are overrated, while bad shortstops and good first basemen are underrated. I think this is what caused MGL's confusion -- an above-average player should be at a position that gets more plays to leverage his superior skills.

BGIM : Maximum Likelihood Estimation Primer (December 26, 2003)

Discussion Thread

Posted 3:04 a.m., January 3, 2004 (#2) - AED
If you actually know the true talent distribution, it's pretty simple. The probability of any player having an ability of x is a Gaussian centered on 0.340 with sigma of 0.040. The probability of a player with ability x having 0.440 in 600 PA is proportional to:
x^264 * (1-x)^336
which is roughly a Gaussian centered on 0.440 with sigma of 0.0237. Multiplying the two probabilities together, you get a Gaussian centered on 0.414 with a sigma of 0.020. (Basically you're weighting each factor by 1/sigma^2.)

Your second option will produce a very similar result. In fact, if the Gaussian approximations are correct, the expectation value will exactly equal what you get from #1. If you are dealing with highly non-Gaussian situations, you have to choose whether to take the mean (as you suggest here) of the probability distribution, the median (which I prefer since it's scale-invariant) or the mode (which is MLE).

BGIM : Maximum Likelihood Estimation Primer (December 26, 2003)

Discussion Thread

Posted 2:52 a.m., January 6, 2004 (#6) - AED
You're right, I messed up. The variance from the binomial approximately equals OBA*(1-OBA)/PA, which gives a standard deviation of 0.0203, not 0.0237.

Putting in the correct number, the most likely true talent is the average weighted by 1/sigma^2, or 0.420.

The standard deviation equals 1/sqrt(1/sigma1^2 + 1/sigma2^2) = 0.018.

Keep in mind that you shouldn't do this for every season's worth of stats if you're trying to make projections from 3 years of data.

Valuing Starters and Relievers (December 27, 2003)

Discussion Thread

Posted 5:28 p.m., January 4, 2004 (#65) - AED
David, the scarcity argument does apply. A replacement level reliever can be picked up off the scrap heap. "Replacement level starter" is a misnomer because he is not freely replaceable (because he has value as an above replacement level reliever).

Valuing Starters and Relievers (December 27, 2003)

Discussion Thread

Posted 1:17 p.m., January 5, 2004 (#67) - AED
Nobody around here seems to know what Woolner did, so I can't comment on his evidence. I think most of us assume he looked at average ERAs for "#6 starters" and "#8 relievers" (or something like that), defined those as "replacement level", and evaluated those averages as a function of league ERA and came up with his equations. If this is indeed the case, it fails to address the problem that a team's #6 starter is probably its #3 reliever, in which case you are not talking about a replacement-level pitcher but rather someone better than replacement level.

SABR 201 - Issues with MLEs - Why I hate them (December 31, 2003)

Discussion Thread

Posted 5:21 p.m., January 4, 2004 (#52) - AED
MGL, probably not. It's not really the reverse of hitting, after all. The batting coefficients reflect differences between minor and major league pitching (plus parks and fielding); the pitching coefficients reflect differences between minor and major league hitting (plus parks and fielding).

SABR 201 - Issues with MLEs - Why I hate them (December 31, 2003)

Discussion Thread

Posted 9:56 p.m., January 4, 2004 (#55) - AED
If the parks and fielding are equivalent, the MLE coefficients will be close to reciprocals for batters and pitchers. There's no reason to assume they are, so it's probably wise to recalculate them for pitchers.

SABR 201 - Issues with MLEs - Why I hate them (December 31, 2003)

Discussion Thread

Posted 5:05 p.m., January 6, 2004 (#66) - AED
FJM, you don't want to look at correlations to test MLE accuracies, because players the varying numbers of at bats mean correlation will be better for players with lots of at bats and worse for those with few at bats. Rather, you want to look at whether or not the performances were statistically consistent with the predictions. A player predicted to hit 0.250 but who hits 0.300 in 50 at bats, for example, is consistent with his prediction at better than a 1-sigma level.

I would expect that MLEs are comparable in value to actual major league stats, because the accuracy of any projection from 3 years' data will be limited primarily by the random noise in the year being projected.

The point about selection effects is right, though. Overbay underperformed his MLE at the 0.3 sigma level, which means there is no reason to believe he was any worse than the MLE suggested. The difference between his prediction and performance can be solely attributed to luck. (and not even all that bad luck!)

Kata did the opposite, getting lucky in his first few weeks to overachieve his MLE and stick. By the end of the year, he was most of the way back down to his MLE.

SABR 201 - Issues with MLEs - Why I hate them (December 31, 2003)

Discussion Thread

Posted 12:07 a.m., January 7, 2004 (#68) - AED
Now, now. Nobody I know uses a certain sigma threshold at which the interpretation suddenly goes from "consistent" to "inconsistent"; obviously it's a gradual scale from better to wrose agreeement. But I can't imagine any situation in which the first data point agreed with the model at a 0.3 sigma level and my reaction was "goodness, my model's not right!", so I think my statement was correct. Given the number of at bats, he essentially performed as expected, yet lost his starting job because too much emphasis was given to two week's worth of stats (when Spivey was injured and Overbay and Kata were battling for one roster spot).

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 3:16 p.m., January 6, 2004 (#3) - AED (homepage)
I've never understood the use of B-T model for team rankings. Random variations in team performance are Gaussian, so the standard error function should be used instead. I've been publishing computer rankings for quite some time that are based on probable superiority between teams, with the added bonus that I calculate it analytically instead of with a Monte Carlo approach. See homepage link for a fairly comprehensive guide to statistics of rankings (in the "rating details" section).

Note that the probability that a team will win the world series is completely different from the probability that it is the best team. If X is the probability that team A is better than team B after a 162-game season, the odds that team A will beat team B in a 7-game series is roughly 0.5*[(X/0.5)^0.30]. The odds that team A will beat team B in a 5-game series is roughly 0.5*[(X/0.5)^0.26]. Skipping through lots of math, the probability of a team winning the world series is roughly proportional to the square root of its probability of being the best team, perhaps an even lower power.

The way to make it optimally efficient would be to replace the three rounds of playoffs with a best-of-19 series (19 to keep a team's maximum number of games unchanged) between the top teams from each league. Better yet, give each team 5-6 games against every other MLB team during the season to give a balanced schedule, and do away with playoffs altogether.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 5:51 p.m., January 6, 2004 (#6) - AED
If team A has a merit value of a, and team B has a merit value of b, the probability that team A will beat team B equals the fraction of the Gaussian error function that is less than a-b (assuming you choose your scale carefully, of course). This is function is similar, but not identical to 1/[1+exp(b-a), which corresponds to a B-T model (taking ln(m) as your merit ranking are measured).

Comparing the two, the B-T model's different shape will affect the determined rankings. This is especially problematic in college sports, where B-T predicts too many major upsets and thus ranks teams too high from beating up on cupcakes, but the difference in shape will produce subtle problems any time it's used. You're right that it's not grossly in error, but since the Gaussian error function is part of the standard C math library, I don't quite see why people opt for a less accurate approach that is also more work to program. Just a pet peeve of mine, I guess.

I'm more interested in team-by-team comparisons, since my goal is to create an optimal ranking order. So I calculate the odds that each team is better than each other team. However, it would be trivial to instead calculate the odds that each team is better than all other teams. One correction -- I meant to say "calculate directly"; the calculation is numerical rather than analytical (although it takes negligible time to compute to arbitrarily high precision). It can also be done completely analytically (and more accurately) if you choose to use game scores rather than wins and losses.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 12:18 a.m., January 7, 2004 (#11) - AED (homepage)
Alan, actually I have made extensive tests. You can see one such plot on my homepage, in 'predicting games', which shows actual winning percentages as a function of ranking difference, plotted against the prediction from the Gaussian model. In all sports, the B-T model overpredicts the number of major upsets. This may not matter as much in pro sports, where there is more parity, but it indicates that the Gaussian distribution is more accurate in general.

Whether you use the normal distribution or the (cumulative) error function depends on which problem you're answering. The odds of one team beating another would use the cumulative function; the odds of getting a specific game result would be based on the Gaussian probability. I tend to use the term 'error function' as synonymous with the cumulative because that's how it's named in the C library (erf() and erfc()).

Again, there isn't a huge difference between Gaussian and B-T models in pro sports; it's just that the Gaussian *is* easier to program and it's also more accurate. Like I said, a pet peeve.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 11:29 a.m., January 7, 2004 (#14) - AED
The line in that graph is not a "fit" to the data, it is the model prediction plotted against the data. The B-T model would not fit it. There are several other tests you can do. For example, using sets of three teams that have played each other, the winning percentages are better fit with the Gaussian than with the B-T. I spent quite a bit of time deciding which model to use when setting up my system, and the Gaussian consistently outperformed the others. I've always assumed the B-T model is common solely because it is similar to the Gaussian but conceptually simpler.

Using a B-T model with a prior is better than using a Gaussian model without a prior. But using a Gaussian model with a prior is better still.

I don't care because I'm focusing on rankings, not probability estimations per se

Actually, for accurate rankings you should still use a prior. Otherwise you are prone to overranking teams with easier schedules.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 4:19 a.m., January 8, 2004 (#20) - AED
You can't treat the pitcher/team combinations exactly as if they are different teams, since the offense is fairly constant regardless of who is pitching. You would also want to ensure that the prior was not applied separately for each pitcher/team combination, lest the team's offense and fielding be regressed multiple times. But I think if you're going to start looking at who was playing, why not just do the full-bore sabermetric approach by measuring a team's player contributions relative to league average and add them up to give a team ranking? This would be way more accurate than anything done using game scores.

Regarding injuries, I have generally found injuries to be less significant than is commonly thought. If a team loses a key player and starts losing, it is attributed to the injury. If they win without him, it was heart, courage, and/or great coaching. (It's kind of like clutch hitting -- you can always attribute the key hit after the fact but can rarely predict it.) For the season, the Falcons showed about as much game-to-game fluctuation as other teams, and if you really want to read into the weekly performances (at the risk of interpreting noise as signal) you would conclude the turnaround happened during the bye, not at Vick's return.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 1:29 a.m., January 9, 2004 (#22) - AED
Actually, the Falcons' first really good game was the second week after the bye. This is why I said it was risking interpreting noise as signal, since I know of no reason why any statistically significant change in performance might have been associated with that game.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 1:35 a.m., January 10, 2004 (#25) - AED
Alan, his overall probability equation appears to have been adapted from my own ratings pages, so yes, I think that's a reasonable approach. A few quibbles though.

About the rankings, he fails to account for the distribution of IAA team strengths. Instead, you create a built-in uncertainty for IAA team rankings, a standard deviation of 0.8 or 1.0, and instead of
phi( theta - thetaIAA )
in his 'part 3', it should be
phi[ ( theta - thetaIAA ) / sqrt( 1 + sd^2 ) ]
The effective difference is that, by assuming you know all IAA teams are identically strong, you penalize teams too much for IAA losses. Of course, it would actually be better for him to go ahead and rank IAA, DII, DIII, and NAIA teams so that this isn't necessary. Wolfe's website has full data for all games played.

Also, his chosen prior isn't particularly well-chosen. One should actually use the distribution of team strengths rather than an arbitrary assumption that the distribution equals phi(x)*phi(-x).

About the potential modifications... Giving different weights to games in a win-loss system is a recipe for disaster in college football. You don't have enough games to have much constraint as it is. I don't think his suggested way for treating margin of victory would work. However, one can convert the game scores into a sigma (team A played X sigma better than team B) and use a Gaussian probability instead of the cumulative probability using the prescription in my info pages.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 6:54 p.m., January 10, 2004 (#27) - AED
As I said, his system isn't all that bad, and appears to be virtually identical to what Wolfe used in 2002 and very similar in design to my own win-loss system. I don't know what SAS is capable of; I do all of my own programming with help from numerical recipes when I need to do minimization.

The prior should be the actual distribution of team ratings from a margin-of-victory rating system. It's close to Gaussian, so in college football you can use a Gaussian with standard deviation 0.9 instead of the single win and single loss against a hypothetical zero team. Better yet, use 1/(1+(x/0.9)^2)^2, which has the same standard deviation but fits the shape of team ranking distributions in all sports pretty well. This requires that you can plug in functions of your own choice within SAS, of course.

A margin of victory ranking can be accurately approximated as a linear problem.

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Discussion Thread

Posted 6:57 p.m., January 12, 2004 (#29) - AED
OK. His "prior" equal phi(x)*phi(-x), which by sheer luck matches the spread of IA football teams pretty well. Can this be made any arbitrary power of that, such as [phi(x)^0.8]*[phi(-x)^0.8]? To adapt this system to baseball, you would want to use [phi(x)^31]*[phi(-x)^31]. If powers can be floating point, you should also use the square root of the probabilities for IAA games (or if not, square everything else).

I seriously doubt his technique could be easily adapted for margin of victory, but as noted earlier it can be done easily as a linear system.

BABIP and Speed (January 7, 2004)

Discussion Thread

Posted 3:56 p.m., January 8, 2004 (#10) - AED
The true variance in skills can be estimated by finding the additional variance needed so that the average of the square of the deviations (in sigma) is one. In other words, find "x" such that:
average [ (RBOE/PA-0.01)^2 / ( 0.0103*0.9897/PA + x^2 ) ] = 1
It turns out that x is 0.0022, closer to Tango's second estimate. This gives a regression toward the mean of 2100/(2100+PA).

MGL takes on the Neyer challenge (January 13, 2004)

Discussion Thread

Posted 3:36 p.m., January 15, 2004 (#21) - AED
It's a tricky problem. Regressing opponents' rankings will wash out real disparities in schedules. In other words, if my opponents have a true ability of 0.450 and indeed go 0.450 (the average outcome), regression to the mean would give me credit for having played a tougher schedule than I actually did. On the other hand, not regressing opponents' rankings leaves you more susceptible to errors.

The optimal approach is to get the most accurate preseason projection possible and regress each team to its projection. Of course doing so will bias your system severely, so it would be unacceptable for the usual computer ranking purposes.

What I do is a two-step process, in which I first compute all team strengths with no regression and then compute each team's ranking using its prior and the opponent strengths.

MGL takes on the Neyer challenge (January 13, 2004)

Discussion Thread

Posted 8:01 p.m., January 17, 2004 (#36) - AED
Dackle, that's not correct -- it does not have to be a self-consistent system. If you want to know a player's opponent-adjusted contributions, you should do so using the best estimate of opponent strengths (which involves a prior) but the player's actual contributions. Were MGL attempting to compute players' opponent-adjusted skills, he would indeed want a self-consistent system where everything was regressed.

Ross, use of a prior is not an attempt to manipulate the data; it is in fact demanded by Bayes' theorem. Failing to use a prior is a violation of statistical principles, and only justified if the prior is too difficult to measure accurately.

MGL takes on the Neyer challenge (January 13, 2004)

Discussion Thread

Posted 3:43 p.m., January 18, 2004 (#42) - AED
Dackle, you've basically reinvented several existing ranking systems, albeit without the necessary prior. The reason a prior is needed can be illustrated easily. Suppose that the Royals were 10-0 instead of 9-1, and that their average opponent's schedule-adjusted record was 0.550. Following your logic, this means that the Royals' schedule-adjusted record was 1.050 -- 10.5 wins in 10 games.

Finally, it seems misleading to say "The Royals' 9-1 record was helped 0.8 wins by the schedule," when the 0.8 is calculated using regressed values.
It isn't the least bit misleading. After 162 games have been played, the number of wins contributed by opponents in the Royals' first ten games will probably be a lot closer to 0.8 than to 2.8. Using 2.8 would therefore be misleading, because it's probably very far from the truth.

MGL takes on the Neyer challenge (January 13, 2004)

Discussion Thread

Posted 6:54 p.m., January 18, 2004 (#46) - AED
Tango, how do odds ratios work for unbeaten teams? Forget the 9-1 example, a 10-0 (or 0-10) team shows why a prior has to be used.

MGL takes on the Neyer challenge (January 13, 2004)

Discussion Thread

Posted 7:52 p.m., January 18, 2004 (#48) - AED
So every unbeaten team has a schedule-adjusted record of 1.000, no matter how easy or difficult its schedule? In football terms, Carroll College of Montana (14-0) and St. John's of Minnesota (14-0) should be ranked ahead of LSU (13-1)? If you don't use priors, you're making a serious mistake.

MGL takes on the Neyer challenge (January 13, 2004)

Discussion Thread

Posted 11:20 p.m., January 18, 2004 (#50) - AED
Dackle, the system you suggest is EXACTLY like Sagarin's Elo-based system, but without the ratings. And I can assure you that such a system would give all undefeated teams infinitely high rankings.

MGL takes on the Neyer challenge (January 13, 2004)

Discussion Thread

Posted 11:21 p.m., January 18, 2004 (#51) - AED
Sorry, typing faster than thinking...

"Dackle, the system you suggest is EXACTLY like Sagarin's Elo-based system, but without the *priors*. And I can assure you that such a system would give all undefeated teams infinitely high rankings."

MLB Timeline - Best players by position (January 14, 2004)

Discussion Thread

Posted 3:41 p.m., January 23, 2004 (#21) - AED
I indeed made an examination of clutch hitting tendencies from Retrosheet data and found a high statistical significance (~99%) that individual batters hit differently in clutch and non-clutch situations. About half the variance in player clutch performance can be explained by a correlation between clutch performance slugging average, the rest is mostly 'clutch ability'. I wrote up the details for a different site, which wasn't interested on the grounds that there was nothing new.

The sim game isn't as detailed as Diamond Mind, but is accurate enough that one can apply EqA, ERC, DIPS ERA, and similar calculations. (I've never mentioned it around here because it's a for-pay site and I don't want to sound like I'm trying to get free advertising.)

MLB Timeline - Best players by position (January 14, 2004)

Discussion Thread

Posted 4:40 p.m., January 23, 2004 (#24) - AED
Clutch differences are responsible for a standard deviation of approximately 7.5 points of OBP in clutch situations for the players in my sample. The inherent distribution of abilities of the same set of players in overall OBP is about 22 points. So it's definitely not a huge factor -- you don't turn 0.250 hitters into 0.400 hitters -- but is large enough to be measured. I don't recall the average leverage, but one S.D. change in one player should translate into about 1/4 or 1/5 of a win over 162 games.

MLB Timeline - Best players by position (January 14, 2004)

Discussion Thread

Posted 5:51 p.m., January 23, 2004 (#26) - AED
Yes, the difference between a 2 S.D. "clutch player" and a 2 S.D. "choker" is about one win per season. The problem is in coming up with a way to know who is clutch and who isn't. Aside from the correlation with slugging I didn't find any simple way of measuring it. (Due to random noise, one needs about 2200 PAs in clutch situations to measure a player's clutch OBP to an accuracy of 10 points.)

I'll try to find my old writeup on this and post it.

SuperLWTS Aging Curve (January 26, 2004)

Discussion Thread

Posted 2:53 p.m., January 27, 2004 (#12) - AED
MGL, I think you're overstating the sampling issue here. The fact that a player's future number of opportunities can be affected by luck in his earlier opportunities does not negate the fact that the opportunities themselves were unbiased measurements of ability.

Also, as long as I'm nitpicking, the statistical uncertainty in the difference of lwts/680 between two seasons is something like sqrt(1/N1+1/N2). So in terms of weighting the values for different players, the weight should equal 1/(1/N1+1/N2), not min(N1,N2).

Tango, no need to regress anything here, unless you want to make a prior assumption of the shape of the aging curve (which is dangerous since it's what you're trying to measure).

SuperLWTS Aging Curve (January 26, 2004)

Discussion Thread

Posted 3:12 a.m., January 28, 2004 (#16) - AED
MGL, I guess I was referring partly to Tango's argument from his old thread than to what you did. You're right about selective sampling, but it primarily affects a player's first year in the majors, not his last. The easiest ways to get rid of that are to either require a player to have played at least N seasons in the majors (to get rid of guys who had one lucky year and then hung on for a little longer) or to eliminate players' rookie seasons from consideration.

Players who are near replacement level from 26-29 will show a spuriously shallow decline because of selection effects -- the only such players still on a major league roster at 34 are those who peaked late or declined slowly and stayed near replacement level.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 12:22 a.m., February 3, 2004 (#4) - AED
Analysis of batter types vs. pitcher types would be a full-blown study in its own right. A very interesting one, but not really the point I was trying to address (and not something I have time to do at the moment). However, I think one can get at the more obvious possible manifestations of that by looking at the correlations - are OBAs of sluggers low in clutch situations because of fewer walks, more strikeouts, fewer home runs, or fewer hits on balls in play? The answer is a combination of all of the above, but primarily fewer walks and fewer hits on balls in play. While closers give up fewer walks and sluggers tend to walk more, this trend accounts for only about 20% of sluggers' loss of walks in clutch situations. And the lower batting average on balls in play is an order of magnitude larger than would be explained by pitching style. So it indeed appears to be the hitter, not the pitcher.

Ryan, a typical Gwynn or Ichiro type of player will generally do about 15 points of OBA better in the clutch than would a typical Thome type of player. I haven't looked into the RISP list in the same detail I did clutch, so don't feel comfortable commenting on that.

Charles, I'm intrigued by your comment that this is "nowhere near as important as the author says it is." Can you point me to any study that found a statistically-significant difference in player performance in clutch and non-clutch situations?

Michael, I don't see much of a correlation or anticorrelation between change in OBA and change in HR/H or TB/H. I'll look into that in more detail when I get a chance.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 1:12 a.m., February 3, 2004 (#5) - AED
One follow-up on the Gwynn/Thome comparison. The typical singles hitter like Gwynn would tend to hit about 10 points of OBA worse in the clutch than would the average major leaguer. Gwynn himself was a very good clutch hitter, of course. (The trends I'm discussing account for only 0.005 of the overall clutch OBA scatter; the other 0.005 of the scatter is for reasons uncorrelated to batter profile.)

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 12:53 p.m., February 3, 2004 (#17) - AED
Thanks for all the interest and feedback. Going through the list quickly:

MGL, a significant chunk of the variation in clutch performance is due to correlations between clutch performance and the players' batting profile. This part can be calculated rather accurately for anyone based on their overall stats. Slugging average is a catch-all that hides a lot of factors in about the same proportion as they influence clutch performance, so that's what I chose to use. I could have given clutch performance as a function of BB/PA, SO/PA, HR/(PA-BB-SO), and (H-HR)/(BA-BB-SO-HR), but it wouldn't have gained a whole lot in accuracy. (My note about Gwynn/Thome types did use such a breakdown; in retrospect I could would have been nearly as accurate using just their career slugging numbers.)

Charles, the standard deviation is indeed small, but since these are high-leverage situations, the hits/wins ratio is higher. I have "win advancements" computed for the same years, so if I get a chance I'll try to put the two studies together.

Erik, in the initial phase of the analysis (the strict binomial test), I used multiplicative adjustments to the on-base averages to account for pitcher difficulty. The league average in 'clutch' situations was 0.322; that in 'early' (innings 1-5) was 0.331. So all clutch rates were multiplied by 1.02 and non-clutch rates were multiplied by 0.99. This should accurately reverse the log5 effect. (To make sure this adjustment wasn't causing anything bad, I also made a second run by adding/subtracting to the rates.) The follow-up you mention isn't needed -- given 150 clutch plate appearances in consecutive seasons and the measured spread of 0.007 in clutch skill, one knows the correlation to be 0.04.

Walt, naturally the game situation will affect what strategies are used. But since pitchers are always trying to prevent runs and batters are always trying to score runs, it isn't clear that there should be a huge difference between how they approach trying to score/prevent runs in the 7th inning rather than in the 2nd inning. I'm not completely sure what your test did, but I did simulate a very large number of seasons with both the null hypothesis (equal performance in clutch/non-clutch) and the clutch hitting model. Only 0.4% of null hypothesis models were more deviant than the actual data.

PhillyBooster, I believe the 0.009 drop in overall OBA in clutch situations is due to the fact that opposing teams tend to send out better pitchers in those situations. I would have to check that more carefully to give a definitive answer.

Charles, the 28% figure is the ratio of the total rms in clutch ability to the rms in inherent OBA skill. The small correlations I have detailed are only part of that total -- an equal part of it appears to be uncorrelated with anything obvious. Batting average is affected as well, about half as strongly as OBA and SLG are lost; this is due almost entirely to batting average on balls in play.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 2:40 p.m., February 3, 2004 (#22) - AED
Here are numbers from the data I used, which may be more directly useful in addressing questions about this article.

The average pitcher in a "clutch situation" had the following career stats:
0.154 SO/PA, 0.081 BB/PA, 0.021 HR/PA, 0.286 BABIP

The average pitcher in a "non-clutch situation" had the following career stats:
0.145 SO/PA, 0.082 BB/PA, 0.021 HR/PA, 0.285 BABIP

So aside from a 6% higher strikeout rate, there's nothing significantly different about the pitching faced.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 4:59 p.m., February 3, 2004 (#26) - AED
Tango, that's correct about the averages. Regressing the pitchers' career numbers to estimate true talent, I estimate that the OBA allowed by an average pitcher in a clutch situation is 0.316, while that allowed by an average pitcher in a non-clutch situation is 0.313. So, again, aside from the strikeout rates (which are still 0.154 and 0.145 after regressing), the overall quality of pitching faced in the two situations is essentially the same. Pitchers in clutch situations allowed 37.02% of batters that are not struck out on base; those in non-clutch situations likewise allowed 37.02% of batters that are not struck out on base.

Ross, absolutely, the choice of "clutch situation" matters. As noted, I used a rather broad definition to include any situation that might have pressure, which allowed a more precise statistical test between the two samples. My main goal was to find out whether or not there is any difference whatsoever; with that determined, follow-up studies are warranted.

If I define "clutch" as 8th inning or later, trailing, and with the tying run on base, at the plate, or on deck, and lower the minimum number of clutch plate appearances to 100, I find the rms spread in clutch skill to be 0.0135 in OBA, which is statistically significantly higher than the spread in clutch skill in my original definition and is over half the distribution of overall OBA. (As before the quality of pitching is the same, except that pitchers in clutch situations are now 10% better at getting strikeouts than those in non-clutch situations.) So yes, it would seem that whatever differences there are get larger in situations with more pressure, which means that I underestimated the impact of clutch hitting on wins. Tango's upcoming study uses lwtsOBA and LI, so he can more directly calculate win impact than I can.

Because the cluth situation definition includes situations when the tying runner is on deck or on base, 'clutch' does not necessarily mean "home run to win or tie". So there are some situations in which the batter feels pressure to hit a home run, and others where he feels pressure to hit a guy home from third, and yet others in which he just needs to get on base safely.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 5:12 p.m., February 3, 2004 (#28) - AED
the OBA allowed by an average pitcher in a clutch situation is 0.316, while that allowed by an average pitcher in a non-clutch situation is 0.313.

I realize this can be interpreted two ways, plus I got the numbers reversed for good measure. Having calculated the "OBA talent" for each pitcher using his career stats regressed appropriately towards the mean, the average "pitcher OBA talent" faced in a clutch at-bat is 0.313 and the average "pitcher OBA talent" faced in a non-clutch at-bat is 0.316.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 6:13 p.m., February 3, 2004 (#32) - AED
Charles, both of our studies carefully compared the actual distributions with random distributions. Mine was more significant in that regard since I used plain OBA (and thus no need to fudge the variance), but Tango also found more variance than would come from randomness for any reasonable value of his fudge factor. So yes, "clutch ability" does exist, in the sense that players have different talent levels in clutch and non-clutch situations.

RISP was one of my samples, and I found a significant deviation between RISP and non-RISP performance there as well.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 9:26 p.m., February 3, 2004 (#36) - AED
David, you seem to have it reversed. Most of the 0.0135 spread in clutch OBA seen in 8th inning and later clutch situations is not correlated with batter profile. (The rms scatter from batter profile is about 0.005; that from other sources is thus 0.0125.) Likewise, the pitching seen in those clutch situations is not very different from that seen in non-clutch situations.

That some of these situations seem to coincide with a fan's idea of greater "pressure" being felt, or with a "Leverage Index", is probably a coincidence.

I had three samples of data (3rd-5th inning, 1 out, and winning by 1 or 2 runs), and in none of these was any significant difference between in-sample and out-of-sample performance found. I also have four samples of data (my original clutch definition, the GABSB clutch definition, RISP, and my 8th inning and later clutch definition), all of which have significant differences. So it is extremely unlikely to be a coincidence. Looking at the amount of the difference among these samples, it also appears that the change in true performance increases with LI.

There is still no real evidence that "clutch" ability exists.

There is evidence that players perform differently in clutch situations in a way that is independent of the player's batting profile or the pitching faced. If it's not "batter profile" and not "opposing pitching", "steel balls" is a probable explanation.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 1:40 p.m., February 4, 2004 (#45) - AED
You cannot look at clutch situations and consider OBP only, and get a true result.

Quite the opposite - by using a binomial statistic such as OBP, you know the expected random variance well enough to establish that the differences between the samples are non-random. Because of the ambiguity in the treatment of variance in lwtsOBA, Tango's results do not establish the presence of clutch differences; however given my finding that they exist his study is a nice follow-up.

For what it's worth, Jack Clark's OPS and lwtsOBA were both also lower in clutch situations.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 5:09 p.m., February 4, 2004 (#51) - AED
Tango, how much of an anticorrelation is there between performance in high- and low-leverage situations?

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 7:25 p.m., February 4, 2004 (#55) - AED
Tango, I should have been clearer. Suppose that you define "neutral" situations as leverage groups 1/2/3. You can then calculate players' "clutchness" by comparing neutral to group 4, and players' "loss of interest" by comparing neutral to group 0. Having done that, is there a correlation (or anticorrelation) between the two?

Steve, it's not that either. I defined yet another "clutch situation" definition that ignores base/out state: 7th inning or later, batting team behind by 1, 2, or 3 runs. The discrepancy between clutch and non-clutch situation batting is still there.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 12:26 a.m., February 5, 2004 (#58) - AED
David, I don't have the data in the right form to search for players with large platoon splits, but I can eliminate platoon effects by selecting only plate appearances against RH pitchers. Overall, I have a hard time seeing that this makes any difference -- with one clutch definition, such as the GABSB definition, the complete clutch sample gives a marginally larger spread; with another, such as my 8th/later definition, the RHP-only sample gives a marginally larger spread.

I agree about the secondary effects regarding RISP; that is why I didn't make a big deal of it (even though the statistical significance was greater).

Tango, I indeed did the reverse by creating a clutch definition that considered only inning and score (not baserunners), as noted in #55.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 4:50 p.m., February 5, 2004 (#62) - AED
Thanks, Tango. Just for reference, what are the average LI's of the five groups?

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 6:53 p.m., February 5, 2004 (#65) - AED
Tango, there's a mistake in your calculations. The estimate of true talent has an uncertainty, not just the clutch measurement. So the "model SD" equals:
1.08*lwtsOBA*(1-lwtsOBA)*(1/Nclutch+1/(Nnonclutch+209))
You left out the 1/(Nnonclutch+209) factor. Correcting this and rerunning, I find very good agreement between our respective results for plain OBA in the clutch.

Making this correction, the standard deviation of clutch effect is around 0.01 clutch lwtsOBA, or about 0.2 wins per season.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 11:45 p.m., February 5, 2004 (#69) - AED
Tango, I'd be surprised if you included the uncertainty of the talent estimate. I was able to reproduce your results pretty well without it.

Cyril, that's not the best way to look at it. The probability of getting 6 or more 2-sigma deviations out of 60 points, purely from randomness, is 8%. So by trying to count number of X-sigma deviations, you'll never measure anything subtle. What I've done is measure the cumulative probability of all data being created from randomness alone, and what clutch hitting distribution was needed to make it consistent.

Strictly speaking, chi^2 (the better-known version of z-scores) is based on Gaussian, not Poisson/binomial statistics. So you won't get as accurate an answer as you would using the binomial probabilities, but as long as the sample sizes are large, the difference isn't that big.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 12:09 p.m., February 6, 2004 (#71) - AED
The equation in the appendix gives the cumulative probablity. The Monte Carlo test gives a large number of "randomness alone" samples for comparison.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 1:51 p.m., February 6, 2004 (#73) - AED
Ross, there is an inherent spread in BABIP among pitchers that can be measured; I don't think there's much question of that. The greater question is how much is due to ballpark effects, quality of fielding, pitcher's skill, and luck. Go to Tango's clutch hitting study and click the "solving DIPS" link for a discussion of the topic.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 8:03 p.m., February 6, 2004 (#78) - AED
Cyril, I'm not sure what you're getting about the correlations of non-close OPS and late vs. close and late OPS with winning. It looks like you're trying to measure LI the hard way. The combined importance of the 15% of PAs that meet your "close and late" definition is 43% that of the other 85% of PAs. Since that ratio would be 18% if LI were constant, this means that the average LI of "close and late" PAs is 2.5 times that of "not close and late" PAs, or mean LIs of 2.0 and 0.8, respectively.

I think we agree that clutch hitting accounts for at most a few hits per year. The issue is leverage. If a player's clutch skill moves 2 hits from "non-clutch" to "clutch" situations (keeping overall batting average constant), this adds about 0.2 wins for the average LI in your definition, more under Tango's definition (which is selected by LI).

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 12:20 a.m., February 7, 2004 (#83) - AED
Its pretty clear that the realtionship between hits and runs is not linear

For the team, yes, since aside from a HR you need more than one "positive event" to score a run. Likewise for pitching. For hitters, though, it is basically linear. The result one at-bat only affects that player's next at-bat if the 10+ batters go in one inning. Otherwise there's lots that affects the importance of a batter's at-bats, but the batter himself isn't responsible for it.

Cyril: I used the coefficients you measured: (0.345+0.421)/(0.918+0.845). I assumed the difference between your offense and defense coefficients was probably just from random noise. (Otherwise, it would seem that giving up a run when ahead 4-3 in the bottom of the ninth hurts you more than scoring a run when down 4-3 in the bottom of the ninth helps your opponent.)

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 4:42 p.m., February 8, 2004 (#87) - AED
A good hitting team gains more than a bad team from the same offensive action.

Sure, a batter hitting behind a great OBP hitter means that he will have a disproportionate number of at-bats with a man on base, and thus his at-bats have greater leverage. However, since he is not himself responsible for that increased leverage, his impact on the game outcome is linear and should be treated as such.

The easiest way to look at it is that each player goes to the plate with a certain probability of his team winning, and finishes his plate appearance with a different probability of a win. The difference between the two is very closely related to Tango's lwtsOBA for the outcome of the at-bat multiplied by the LI. The lwtsOBA is the player's responsibility; the LI is not.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 11:59 p.m., February 8, 2004 (#89) - AED
Ross, I think we're talking about different things. You initially raised the point that run creation is not linear, which I interpreted as meaning how runs are scored within one inning. While that is true, each player's contribution is linear -- enter with X% chance of winning, leave with Y% chance of winning, and Y-X is closely related to the lwts of the outcome times the leverage index.

The net offense is zero, pretty much by definition. There will obviously be rounding issues; I think the study you are thinking of was in error by something like 0.0002 wins/PA -- it adds up over a lot of seasons, but the error itself is negligible.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 2:26 a.m., February 9, 2004 (#92) - AED
Ross, the median change is negative, in that over 50% of at-bats result in outs. However, the average change is zero because the positive changes tend to be larger than the negative changes. The only way to get net offense to be nonzero for the entire league is to estimate the run-scoring environment correctly. If your metrics are set up for 4.80 runs per 9 innings and the league average is 4.90, you'll get a positive net offense and equally negative net defense.

I don't quite follow your second point. By using the LWTS run value for an event and multiplying it by the LI and dividing it by the average wins/runs scaling ratio, you do account for different performance in different situations. In other words, if a strikeout costs your team -0.027 wins on average but the LI is 2.0, then a good first-order estimate is that by striking out, you cost your team -0.054 wins. If you happen to strike out when LI is 2.0 and happen to hit singles when LI is 0.5 (i.e. you're choking), a leverage-weighted lwts estimate will show you as choking. So there are some finer details that get swept under the rug with this approximation, but since detecting any clutch skill is on the hairy edge of our statistical abilities, I think this is good enough.

Note that it is still linear. The total amount of harm done to my team if I strike out in two games with the bases loaded in the 9th and down a run is exactly double the harm done had I only done that once. So the total effect is indeed the sum of the individual effects.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 1:33 p.m., February 9, 2004 (#101) - AED
Ross, if the mean is negative, it's because of an error in the computation process. As I noted on the thread on this topic, the error amounts to 0.0002 wins/PA, so it's not like it's a huge problem in his methods. However, the true total offense is indeed exactly zero, as is the mean outcome of a plate appearance.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 2:23 p.m., February 9, 2004 (#104) - AED
Tango, to answer your question from #97...

The values you are measuring are (variance)/(expected variance). The expectation value for this is 1.0; the random variance in this is 2.0.

For multiple measurements, the random variance drops as 1/N while the expectation value remains 1.0. So for 340 measurements, you would expect the average to be 1.0 with a random variance of 1/170 or a random standard deviation of sqrt(1/170)=0.077.

If you measured a value of 1.12, there is a 6% chance that a value of 1.12 or higher could have arisen purely from chance.

However, I mentioned I was unable to reproduce your findings when I included the uncertainty of the regressed averages. When including this factor, I measure a variance ratio of 1.04, which is well within the noise.

As for your other question, park effects shouldn't really matter, since players are in the same park for low-LI and high-LI at-bats.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 3:08 p.m., February 9, 2004 (#107) - AED
Ross, it's not that Tango is making an arbitrary assumption that the average probability change should be zero. It's that accurate probabilities will result in a situation where the average probability change should be zero. If the average change is not zero, it is the probabilities that are in error.

Suppose that I assign a 0.90 win probability for the home team in a game tied heading into the bottom of the 9th inning, for example. Obviously at the end of that inning, the home team's win probability will be either 1.00 (if they scored) or 0.50 (if they didn't), which means that the home team's offense gets credited either with +0.10 wins with a score or -0.40 wins if they don't score. Now suppose that only 30% of such teams actually score in this situation, meaning that the "average" wins produced by offenses in this situation is -0.25 wins. By your argument, I would conclude that offenses in this situation tend to produce -0.25 wins. However, that's not right; the reason for the -0.25 win average was that my 0.90 win probability was a mistake and should have been 0.65. Setting the win probability to equal the actual odds that a team in that situation will win the game, the net wins produced equals exactly zero.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 10:01 p.m., February 9, 2004 (#121) - AED
I've already addressed that study several times. In 4443803 PAs, the author finds a total offense contribution of -1067 wins and a total pitcher contribution of +1067 wins. This means that the discrepancy is 0.0002 wins per PA. So first off, if you really think you can measure contributions to an accuracy of 0.0002 wins, you're deluding yourself.

More to the point. If I estimate a 0.60 win probability from a specific situation, and the average win probability from the next situation is 0.70, I've screwed up. Put differently, if I calculated a 0.60 win probability for a specific situation but 70% of teams in that situation go on to win, then obviously the win probability in that situation was really 0.70! This isn't an arbitrary "theoretical construct" as Ross claims; it's the straightforward definition of the term "win probability".

Anyone who has done this sort of stuff knows that getting the transition probabilities perfect is ridiculously hard, and that small errors of this size are to be expected.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 1:38 a.m., February 10, 2004 (#128) - AED
But we can measure how often the offensive and defensive teams have won with a no one out, runners on first and third in the bottom of the third inning with the game tied. And we can measure how often they win with one out, and we can measure the difference between those two and we can assign that difference to the player who was at the plate and made the second out. And we can do that for every plate appearance. And when we are done doing that we can add up all those changes.

If you actually bothered to do the analysis you suggest, you would find that the offensive team gains exactly as much ground as it lost.

However, this is not what Oswalt has done; he's estimated the state-to-state transition probabilities and computed win probabilities for each state using those estimated transition probabilities. The estimations are pretty good, but not perfect. The errors in those assumptions are what cause the 1000 win discrepancy.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 12:56 p.m., February 10, 2004 (#141) - AED
No one can find an error in Oswalt's methodology or his data that would demonstrate why it is wrong.

This is false. In my previous post (#128), I did in fact carry out the work you suggested. That is why I know for a fact that, if you took empirical transition properties for base-out-inning-score states, you would indeed find a total offensive value of zero.

Clutch Hitting: Fact or Fiction? (February 2, 2004)

Discussion Thread

Posted 2:28 p.m., February 10, 2004 (#146) - AED
Fine, Ross. Even though you didn't realize that I had done the work, you do now. And as I said, I do not duplicate Oswalt's 1000-win discrepancy; rather I find it to be spurious. If you choose not to believe me, that's your problem - not mine. I have better things to do than argue over something as self-evident as this...

The genius of Paul DePodesta (February 4, 2004)

Discussion Thread

Posted 9:52 p.m., February 10, 2004 (#30) - AED
Mike, I'm not sure what you're talking about. Using aggregate data in the way that is described in that paper is completely different from using known distibutions of talent as priors in Bayesian inference.

Clutch Hitting: Logisitic Regression (PDF) (February 10, 2004)

Discussion Thread

Posted 2:18 p.m., February 10, 2004 (#3) - AED
...the results of this analysis stand in strong contradiction to the results of Andrew Dolphin...

Well, hold on a minute... As I noted in post #104 of the clutch hitting thread, I found a chi^2 of 1.04 for Tango's data, with random S.D. of 0.077 in chi^2. This gives a 30% likelihood that these data (or less consistent data) could have been produced without a clutch factor. You find a 31% chance. Therefore you are confirming my analysis of Tango's data, not contradicting it.

In regards to my study, your statements are misleading. You did not determine that players do not perform differently in the clutch; rather you determined that any clutch factor was sufficiently small that it could not be definitively detected in four years' of data. This says absolutely nothing about whether or not it can be detected at the 2-sigma level using 24 years' of data. Given that our techniques give the same results on Tango's data, if anything your calculations show that mine are right and thus that the results from my larger data sample (analyzed similarly) are probably also right.

Clutch Hitting: Logisitic Regression (PDF) (February 10, 2004)

Discussion Thread

Posted 5:08 p.m., February 10, 2004 (#8) - AED
Alan, it is customary to provide upper limits for non-detections. In other words, how large would the 'clutch effect' have to be for you to detect it? I'd guess that you're only sensitive to clutch if the standard deviation of the clutch talent distribution is 0.015 or higher. Can you quantify this more precisely?

Actually I noted my disagreement several times (#65, #69, #104), but that thread seems to have gotten hijacked by win advancement minutae so I fully understand how things get missed...

Clutch Hitting: Logisitic Regression (PDF) (February 10, 2004)

Discussion Thread

Posted 2:42 a.m., February 11, 2004 (#10) - AED
Well, I didn't claim a p<=0.0001 level; I claimed a p<=0.009 level. Requiring 0.0001 is equivalent to a 3.7 sigma detection...

Granted that you won't measure a single player's clutch skill to arbitrarily high precision, but you can regress it to estimate the true talent level. Since we're talking a few tenths of a win, it's not going to be a huge difference but isn't negligible either, thanks to the fact that those situations are high leverage.

Clutch Hitting: Logisitic Regression (PDF) (February 10, 2004)

Discussion Thread

Posted 5:06 p.m., February 11, 2004 (#17) - AED
If I'm reading this correctly, breaking it into a multinomial model and testing the various factors independently would require you to search for "clutch" factors independently in all factors, right? If sample size is killing you now, won't it be worse trying to measure clutch changes in triples rates?

As the data are posted, I think the best you can do is estimate the standard deviation of lwtsOBA as:
0.883*sqrt(lwtsOBA/PA).
Although this approximation is imperfect on a player-by-player basis (a high-HR player will have a larger random variance than a low-HR player with the same lwtsOBA), it is correct for the overall league and thus the overall chi^2 model should work.

Clutch Hits - Tango's 11 points to think about --- to understand why we regress towards the mean (February 12, 2004)

Discussion Thread

Posted 4:38 p.m., February 12, 2004 (#9) - AED
Every above-average player is overrated by their stats, and every below average player is underrated by their stats.

Such a statement isn't entirely inaccurate. It's easy to get confused here. A player whose "true talent" OBA is 0.400 is most likely to have an OBA of 0.400. However, because of sample biases (there are more mediocre players than great ones), a player who had an OBA of 0.400 probably has a "true talent" that is less than 0.400.

MGL, the x/(x+PA) equation is indeed the correct solution in the case that the Gaussian approximations are valid (or are reasonably close to valid).

EconPapers: Steven Levitt (February 24, 2004)

Discussion Thread

Posted 4:01 a.m., March 1, 2004 (#11) - AED
Only if the line becomes sufficiently skewed that the odds of winning the bet are high enough to make it worthwhile. I don't recall the exact amount, but bookies take a "commission" of sorts, so that you have to bet something like $1.10 to have the chance of winning $1.00. Statistical models can beat the odds, even with the overhead costs, but it's only about a 5% profit margin on highly volatile transactions.

Batter's Box Team Previews (March 1, 2004)

Discussion Thread

Posted 11:10 p.m., March 1, 2004 (#4) - AED
Yeah, once you get past OPS, it's not something the average person can do without a calculator. At which point, you should use something more significant. If you want a batting average scale, how about multiplying Tango's lwtsOBA by 0.78? Or the base equation of EqA could be scaled to give another comparable value, but lwtsOBA is better-correlated with run production than is EqA.

Silver: The Science of Forecasting (March 12, 2004)

Discussion Thread

Posted 2:31 p.m., March 17, 2004 (#47) - AED
MGL, the bands have little predictive value. Most of the error in any decent projection will be caused by randomness in the upcoming seasons' stats, which that is Gaussian.

For example, in the sample of players with 300+ PAs, the expected rms error from randomness alone is about 0.074 in OPS. This means that the projection errrors were around 0.05 in OPS. And if you take any reasonable distribution of projection errors and convolve it with a Gaussian that is 50% larger, you'll end up with something that looks awfully Gaussian.

So if a player has a strongly non-Gaussian band, all it tells you is that Nate couldn't find very many comparable players.

Wolverton - Wavin' Wendells - Outs on Base (March 12, 2004)

Discussion Thread

Posted 2:40 p.m., March 17, 2004 (#19) - AED
I've got a fairly detailed program to optimize lineups. Looking at a sample of about 5 teams last season, the mean improvement was 0.1 runs per game. Bullpen usage can typically be improved by 3 wins, though of course it depends on the manager's tendencies. Other tactics, like bunts, steals, etc. can profit a team another win or so. So I think Tango's estimate of about 5 wins for optimizing an average team is correct.

Wolverton - Wavin' Wendells - Outs on Base (March 12, 2004)

Discussion Thread

Posted 1:25 a.m., March 20, 2004 (#46) - AED
MGL, the five teams I checked lineups for were the Yankees, Diamondbacks, Cardinals, Astros, and I'm not sure of the other. For the first three, I used their typical lineup; the Diamondbacks seemed to use a different batting order every single game so I did my best guess of their 'usual'.

The bullpen analysis was done for only a couple teams, since it took a lot more work. The Diamondbacks had 2-3 very good relievers, but gave almost all of their "tied game" work to someone else. Maybe other teams were less blatantly inefficient, in which case the number of wins gained would be less.

Wolverton - Wavin' Wendells - Outs on Base (March 12, 2004)

Discussion Thread

Posted 7:27 p.m., March 21, 2004 (#49) - AED
wow, I was posting much too late... That last post (#48) was me, not MGL. sorry!

Wolverton - Wavin' Wendells - Outs on Base (March 12, 2004)

Discussion Thread

Posted 10:33 p.m., March 23, 2004 (#52) - AED
MGL, I used those numbers because they were the easiest to find. Here are the 2003 PECOTAs:
Mantei: 4.14
Oropesa: 5.05
Valverde: none
Villareal: 4.77
(Valverde was listed in BP2003, but had been injured in 2002 so they didn't make a projection for him.)

But I wasn't thinking of the PECOTAs, as much as the fact that the Diamondbacks themselves have touted Mantei as their ace closer and Villareal and Valverde as top prospects, while they released Oropesa after 2002. So they sent the reliever with the worst PECOTA, the worst actual 2003 stats, and the least confidence into tie games.

I could be wrong, but I don't think this reliever pattern is all that uncommon. It seems that a fairly typical pattern is to use the top relievers only when winning or to get them innings.

Copyright notice

Comments on this page were made by person(s) with the same handle, in various comments areas, following Tangotiger © material, on Baseball Primer. All content on this page remain the sole copyright of the author of those comments.

If you are the author, and you wish to have these comments removed from this site, please send me an email (tangotiger@yahoo.com), along with (1) the URL of this page, and (2) a statement that you are in fact the author of all comments on this page, and I will promptly remove them.