Tango on Baseball Archives

© Tangotiger

Archive List

DIPS year-to-year correlations, 1972-1992 (August 5, 2003)

Treating this as a first step, and realizing that we have parks, and switching teams to take into account, here are the year-to-year correlations of all 1687 pitchers with at least 500 PA in consecutive years (which itself may imply a selective sampling issue) from 1972-1992:


Event r
K 0.78
BB 0.66
1B 0.47
HR 0.34
XBH 0.26
.....
1bBIP 0.25
xbhBIP 0.21

Except for BIP, all are based on a per PA basis. XBH = (2b+3b)/PA.
1bBIP=1b/(PA-HR-BB-K), xbhBIP=(2b+3b)/(PA-HR-BB-K)

Well, well, well. Unless I goofed somewhere, the pitcher's skill is MORE prevalent among his singles than his doubles!

Discuss.
--posted by TangoTiger at 03:30 PM EDT


Posted 3:36 p.m., August 5, 2003 (#1) - tangotiger
  Lowering the bar to at least 250 PA in consecutive years, and you get the same order of results (all r are about .05 less than the above ones).

Event r
K 0.74
BB 0.61
1B 0.40
HR 0.30
XBH 0.22

1bBIP 0.18
2bBIP 0.17

Posted 4:12 p.m., August 5, 2003 (#2) - tlbos
  May I be the first to say...weird?

Posted 4:15 p.m., August 5, 2003 (#3) - Jason
  I'm not sure I understand...So what does this say about DIPS?

Posted 4:19 p.m., August 5, 2003 (#4) - Dan Werr(e-mail)
  I'm sure this is an infinitesimal factor, but singles are the area where you'd expect pitcher fielding to make its mark.

Posted 4:20 p.m., August 5, 2003 (#5) - dsm
  Maybe defense is more important for doubles & triples, which would make other factors (like pitcher skill) more important for singles.

Posted 4:21 p.m., August 5, 2003 (#6) - tangotiger
  You know what's even more weird? The year-to-year correlation for single/BIP and (2b+3b)/BIP for that second class was .18 and .17, right? The year-to-year correlation for (1b+2b+3b)/BIP was .15.

I think, though I'm not sure, that this must imply some negative relationship between 1b and 2b+3b. This may be due to the GB/FB tendency of the pitcher (FB pitcher has more xbhits and outs, than a GB pitcher).

(For the 500 class, those numbers are .25, .21, .20)

As for what it says of DIPS, there's no change. The year-to-year r is .20 as has been reported by many people many times with very different data sets. It's still our best guess that if a pitcher has a (1b+2b+3b)/BIP rate of .320 and the league is .300, then a pitcher's "true" talent, based on the BIP, is .305 (80% regression towards the mean, or 1-r). This applies to pitchers with 500 to 1200 PAs.

Posted 4:21 p.m., August 5, 2003 (#7) - tlbos
  First thought, are there different kinds of singles? Do pitchers exert more control over the single that flies over the infield and lands in front of the outfielders than say, the grounder between the 3B and SS? Is there perhaps a certain single that's easily preventable by pitchers?

Is there (or could there be) a PbP data-based stat based on:
a) the fielding zone the ball is hit to
b) the type of hit (grounder, line drive, fly)
c) the chance of this type of hit in this zone to be fielded by an average fielder

At the end we would have some kind of adjusted hits on balls in play. That would be more or less "defense independent".

Tango, have you noticed any interesting individual pitchers in terms of the difference between 1b and XBH?

Posted 4:22 p.m., August 5, 2003 (#8) - tangotiger
  I think I do agree that fielding and park may play a much larger role on flyballs than ground balls, and therefore, we'll see the pitcher's influence relatively less.

Posted 4:23 p.m., August 5, 2003 (#9) - Andrew Edwards
  Hmmm...

Allow me that singles tend to be given up by ground ball pitchers, XBH by flyball pitchers. Does this suggest that GB pitchers have more control over BABIP? Or that infield defence is more stable than outfield defence?

Just tossing hypotheses.

Posted 4:27 p.m., August 5, 2003 (#10) - Andrew Edwards
  Post #9 was written before I read posts 5-8.

Stadium could account for some of this, but I suspect that there's also a degree to which outfielder defensive performance tends to decay more abruptly than infielder performance.

Alternately, this could mean that outfielder defence has more influence than infielder defence, which is couterintuitive, at least to me.

Posted 4:30 p.m., August 5, 2003 (#11) - Ted Arrowsmith
  While this is not what I would have expected, doesn't it make sense that the changes in batters, defense, park, weather, and luck from year-to-year matter more for turning hits into doubles than balls-in-play into outs. In particular, it seems possible that defensive skill of corner outfielders might make a big difference in xbhBIP and that corner outfield defense would vary alot year-to-year.

Good stuff Tango.

[an error occurred while processing this directive] Posted 4:34 p.m., August 5, 2003 (#13) - tlbos
  Re: #5/#8 - I'm not sure I see that one. Is that an intuitive argument? Or does someone have data to indicate a better CF prevents more XBH (relatively) than a better SS prevents singles?

Posted 4:40 p.m., August 5, 2003 (#14) - Skem
  What about looking at Doubles-Per-PA? Just an idea; I'd imagine triples to be flukier than doubles and that may cause some difference? Though I guess triples are more-likely-than-not XBH by speedy guys or with a misplay by the fielder.

Could the PA denominator play a part in the lower r? As BBs/HBPs fluctuate, so does the percentage of chances that a hitter has to get a good pitch to smack for an XBH or even a single.

Just a WAG though.

Posted 4:41 p.m., August 5, 2003 (#15) - Srul Itza
  Has anyone ever done a study on roster "turn-over", as it relates to positions? Is there any difference at all between how likely a team is to have changes in its middle infielders or corner infielders or outfielders, from year to year? And whether this has changed over time.

I know that there have been studies regarding roster turn-over in general, in response to the issue of whether players have changed teams more often in pre-free agency as opposed to post free-agency (general answer, IIRC is not really; the difference is that the player now has some control over the turn over, whereas before the owners had almost all of the control). I was wondering if that was ever broken down by positions in any way.

I do not know that this would have any effect on the findings above; it was triggered, though, by the comments regarding the differences between single rates (which would tend to be more affected by infield, especially middle infield, defense) and 2b/3b rates (which woudl tend to be more affected by outfield play).

Posted 4:58 p.m., August 5, 2003 (#16) - Erik Allen
  In regards to correlation coefficients for 1Bbip vs. xbhbip - I don't think you have to hypothesize something about the inner workings of the game in order to get the results indicated.

Correlation coefficients for one year to the next should be dependent on the relative number of occurences of each type of event in a given year. Extra base hits occur much less frequently than singles, so we would expect that the relative variation in extra-base hits to be larger, year over year, and hence the correlation coefficient should be smaller.

Just a thought...I may be wrong on my theory

Posted 5:02 p.m., August 5, 2003 (#17) - William Fontaine De La Teaur Dauterive
  Tango, could you please make the raw data available?

Posted 5:14 p.m., August 5, 2003 (#18) - Jim
  Great stuff Tango! As I mentioned in another thread, one might expect XBH to represent "harder" hit balls than singles. Now it's not so clear there is a difference in pitcher ability. Probably it's because a lot of singles are hard-hit balls and a lot of doubles and triples are swung under by the batter but happen to fall in the gap or down the line. Yet HRs are probably always partly the fault of the pitcher. There aren't really a lot of flukey HRs.

I think the groundball/flyball distinction is an important one, and it would be interesting to see this breakdown with respect to BABIP and related stats.

What research has been done on GB/FB pitchers? Do they correlate with other pitcher characteristics (power, control, sinkerballers, knucklers, age)? Any (even anecdotal) evidence that pitchers can reliably induce grounders or flies by using certain types of pitches or locations in the strike zone?

Posted 5:22 p.m., August 5, 2003 (#19) - tangotiger
  I really gotta go run, but here you go. I broke up the pitchers with at least 250 PA in both years into "GB" and "FB" pitchers.

I ran the correlation only for the xbhBIP category. The FB pitcher's year-to-year r was .10, while it was .19 for the GB pitchers. Seems to me that park and OF fielders play a big part here.

Ok... I'll do the same for 1bBIP: .13 for GB pitchers, and .15 for FB pitchers. Again, makes sense.

Sorry, but I don't have the breakdown by FBhits, GBhits, though that would be very useful.

Posted 5:59 p.m., August 5, 2003 (#20) - CFiJ
  tango,

That difference seems relatively small. Have you run any inferential tests to see if it is significant? A .04 difference may simply be random variation...

Posted 6:17 p.m., August 5, 2003 (#21) - studes (homepage)
  Okay, I know I'm showing my ignorance again. (I'm almost embarrassed to post sometimes!). But 1b and xbh on bip are basically even, right? At least, the coefficients are not different enough to draw significant conclusions. So the difference with the coefficients of the 1b rate vs. xbh rate must lie in those pitchers who have different rates of balls in play.

So, against those pitchers who allow more balls in play, those balls are more likely to be singles than extra base hits, right? This obviously would be a huge new insight, but isn't that essentially what the data says?

Posted 6:39 p.m., August 5, 2003 (#22) - studes (homepage)
  Nope. My bad. See what I mean? I misinterpreted the coefficients. So let me restate it: Among pitchers who allow higher rates of balls in play, the rate of singles is more predictable than any other kind of hit. Am I getting close?

Posted 6:40 p.m., August 5, 2003 (#23) - J Michael Neal(e-mail)
  Correct me if I'm wrong (I just took a final this afternoon and am kind of fried; it was a statistics class, oddly enough) but could the correlation difference mean not that outfield defense is more important, but that there is more variation in it? I think that there's some chance that teams are more willing to throw out three crappy outfielders than that they are willing to do the same with bad middle infielders, so the difference between the best and the worst outfield defenses could be greater than the difference between the best and worst infield defenses.

Posted 6:51 p.m., August 5, 2003 (#24) - deepsouth
  Just want to point out that post #16 Erik Allen brings up something important ... the lower coefficient on XBH doesn't necessarily have any implications about baseball, it could easily be explained as the nature of the data

Posted 7:11 p.m., August 5, 2003 (#25) - Jason Koral
  J Michael Neal's point about higher variation in quality of outfield defense finds some support in MGL's 3-year UZR data.

Posted 7:44 p.m., August 5, 2003 (#26) - FJM
  Let's change the R's to R^2's for the 250+ group so we can think in terms of % of variance explained.

1B/BIP: .18^2 = 3.2% explained.
XB/BIP: .17^2 = 2.9%.
XBH/PA: .22^2 = 4.8%.
1B/PA: .40^2 =16.0%.

Now the question is, which of these doesn't belong with the others? I think the answer is pretty clear.

It makes sense that XBH/PA would have a somewhat higher R^2 than XB/BIP since the denominator is bigger and hence more stable. (If K/PA, BB/PA and HR/PA were all perfectly correlated year-to-year it wouldn't make any difference. Since they aren't, it does.)

It also makes sense that the XB/PA R^2 is only a little higher than the XB/BIP R^2, because the bulk of the variation comes from the numerator, not the denominator. But that means that the 1B/PA R^2 should also be only slightly higher than the 1B/BIP R^2, not 5 times greater. Something is wrong somewhere.

Posted 7:45 p.m., August 5, 2003 (#27) - Arvin Hsu
  I'll toss my hat in the ring and side with Erik.

The r^2 of .4 for 1b vs. .3 for 2b should result solely from less variance in the number of singles, because you have more observations of the number of singles per pitcher per year.

-Arvin

PS> In fact, if you control for # of observations decreasing the variance of the binomial, you might actually get a higher _actual_ degree of control for xbh.

Posted 8:39 p.m., August 5, 2003 (#28) - RossCW
  Just want to point out that post #16 Erik Allen brings up something important ... the lower coefficient on XBH doesn't necessarily have any implications about baseball, it could easily be explained as the nature of the data

As could the difference between the correlation of BABIP from year to year and K/9 or HR or ... In fact its quite likely that the relative correlation of two different stats from year to year has no baseball meaning.

Posted 7:26 a.m., August 6, 2003 (#29) - Erik Allen
  Ross CW (#28)

While I agree with you that, in a strict sense, comparing correlation coefficients of two statistics from year-to-year is technically meaningless, I think in certain cases it can be useful. In McCracken's original DIPS work, he simply shows that there is simply MUCH less predictability in BABIP than in K/9, BB/9 etc. So much less, in fact, that sample size issues are probably not the sole cause. This discovery in itself is quite interesting, because it explains in some sense why it is difficult to predict pitcher ERA from one year to the next.

I think the larger problem is assigning a cause to this type of study. McCracken attributes the discrepancy in correlation coefficients to a COMPLETE lack of pitcher "control." Subsequent writing on this site and others has shown this to probably be false.

Summing up (wait, I actually had a point? :)) : I think that comparing correlation coefficients is a pretty rough test to use, and owing to sample size effects, and the old aphorism that "causation does not imply correlation," it is really difficult to show an effect exists and to attribute a reason to that effect.

Posted 8:49 a.m., August 6, 2003 (#30) - tangotiger
  We are not trying to establish if a pitcher has a skill, even though I and others are saying that, when we look at the year-to-year correlation.

What we are really saying is "does this particular metric correlate well year-to-year.... and if it does NOT correlate well year-to-year, then we should not be using it as a basis to predict the next year's metric".

So, if we replace the "ability" talk with the "metric's persistence", I think we'd be more accurate.

So, regardless of the extent to which a pitcher has a skill at preventing hits on balls in park, we are saying that:

[Official quote]

the metric "hits per ball in park" has an r of about .20 among pitchers with 500 to 1200 PA, and therefore we need to regress that metric heavily (80% for the group, which may not necessarily apply to the individuals to the same figure), if you want to predict next year's metric.

Even having next year's metric still does not tell you about the pitcher's true underlying skill at preventing hits on balls in park. Just to the extent that we can measure this underlying skill, that's our best guess as to the expected outcome of that skill, with a [insert number] margin of error.

It may very well be that if we look at very specific breakdowns by zones, opponent, fielders, park, weather, etc, that we CAN ascertain what a pitcher's skill is at preventing hits on balls in play (see: PZR). It's just that, for the moment, the metric called "hits per ball in park" does not do a good enough job at establishing the pitcher's skill with "hits per ball in park". (This would be similar to ERA, earned runs per 9 innings, does not do a good enough job to establish a pitcher's skill at allowing "earned runs per 9 innings".)

[End Official Quote]

************
I may be completely wrong, but the numerator is irrelevant to establish the "strength" of the correlation. Triples/PA for a hitter I believe has an r over .50.

Think of it this way. Say I do: x = Triples/PA * 10 + .300, and then say newRate = x / PA. And I did a correlation year-to-year with either Triples/PA or x/PA.... I'm almost positive that my "r" will be identical.

It's the denominator that counts, not the numerator.

Posted 9:47 a.m., August 6, 2003 (#31) - Erik Allen
  Mr. Tiger (would this be the correct formal address?),

Let me just start by saying that I agree with everything you say up to the end of the official quote. You are absolutely right, IMO, to say that correlation coefficients can give us an indication of how predictive a given statistic will be for the next year. For many situations, this is all we really need, since we are simply trying to project next year's performance...my only objection was in trying to relate these correlation coefficients to physical realities of the game (i.e. attributing blame or credit to the hitter, pitcher, or fielder). I think that going down that road is very difficult to justify.

I am not sure if the last part of your message was directed towards my comments, and I am not sure if we are talking about the same thing (my fault probably...I am not the most eloquent writer). So, let me expand upon post 16:

Imagine we have Joe Pitcher. Joe (or his fielders, or whomever) have a skill set such that 20% of balls in play fall in for singles, and 10% fall in for extra base hits. I am not sure about these numbers, but they seem to be in the right ballpark. For simplicity, let's treat these as independent, binomial variables. That is, we assume for each trial (a ball hit in play), there is a 20% chance that it falls in for a single, and an 80% chance that it does not. Similarly, for each trial, there is a 10% chance that the ball will fall in for an xbh, and a 90% chance that it does not. This is clearly a huge oversimplification, but it can suffice for now.

Over the course of the season, Joe Pitcher gives up 500 BIP. The expected value of singles should be
1B = n*p = 500*0.2 = 100.
Similarly, xbh = 50.
The standard deviation is given by
STD = sqrt(n*p*(1-p))
1BSTD = 8.94
xbhSTD = 6.71

Therefore, the relative standard deviation (STD divided by the expected value) is
1BRSD = 8.94 / 100 = 0.0894 = 8.94%
xbhRSD = 6.71 / 50 = 0.134 = 13.4%

The purpose of this analysis is to show that we would expect more year-to-year variability in xbhBIP simply because of sample size differences. So, say pitcher A is slightly better than pitcher B at preventing both 1Bbip and xbhbip, and by the same amount. We would expect, based on the above analysis, that we would more frequently OBSERVE pitcher B to be superior at preventing xbhbip than we would 1bbip. This could possibly explain the discrepancy you find, and we don't necessarily need to invoke any baseball reasoning to explain the data.

If you happen to locate a statistic that displays a HIGHER year-to-year correlation, even with a smaller numerator (i.e. hitter triple rates), then this would seem to imply that the differences in player ability outweigh the variability of the statistic.

Posted 10:47 a.m., August 6, 2003 (#32) - tangotiger
  Tom or Tango or Tangotiger is good.... I'm not old enough to be a mister.

Ok, I just ran the following test, and perhaps you can tell me what it means. I took 5 pitchers each with 1000 PA, and I randomly gave them, for each PA, a double, a single, or an out, at the rate of 0.1, 0.2, 0.7.

Therefore, we "know" what their true rates are. And we give them a full season to let their true rates manifest themselves.

Then, I did the same for year 2.

As an example, here are singles allowed, year-to-year, for the first 4 of the 20 pitchers in my group.

203,201

208,192

211,196

199,207

Now, since we know, absolutely know, that it's the same talent rate, then we should be able to explain the "r" based strictly on some statistical principle, probably standard deviation. [I'll let you insert that here.]

Anyway, for these 20 pitchers, here are their year-to-year r
2b: .18, 1b: .47, out: .11

Wouldn't we have expected the out, with the highest numerator, to have the highest r, based on your previous explanation?

Now, what I did for a second test was take the same 20 pitchers, but this time, change their talent rates in the second year. For example, allow a .10 doubles rate in the first year, and make it .08 in the second year for 1 pitcher, or .12 rate in the second year. In essence, I'm trying to change the talent rate of my pitcher year-to-year to try to get a lower "r".

Here was the results of that:
2b: .10, singles: .33, outs: .35

I'm not sure what this means, if anything. PErhaps having only 20 pitchers is really limiting, and maybe I should redo with 50 or 100 pitchers.

I look forward to your comments...

Posted 10:48 a.m., August 6, 2003 (#33) - tangotiger
  In my initial comment, "5 pitchers" should read "20 pitchers".

Posted 11:21 a.m., August 6, 2003 (#34) - Erik Allen
  Hmmm...that is very interesting stuff.

I can't say that my hypothesis is fully supported by your data, but at least some of it is predictable...

First, let's agree on some notation, to make this easier...
Going back to my college stat book (brush off dust...) I see that a coorelation coefficient is defined as:

Corr = sum over i [(x_i-x_avg)*(y_i-y_avg)]

where x_i and y_i here would be the 1Bs allowed for two consecutive years. So, in your example, pitcher 1 gave up 203 and 201 singles in two consecutive years. So, x_1=203, y_1=201. x_avg and y_avg would be the true rates. x_avg=200, y_avg=100.

In your first simulation, all 20 pitchers should have the same ability. Therefore, if pitcherX were ABOVE average one year, we should not expect him to be ABOVE average the second year, and I would think that corr=0 for a sufficiently large sample. So either 1) 20 pitchers is too small a sample size or 2) I don't know what I am talking about. Give a 50/50 chance to both those possibilities. :)

The second case is closer to what I was imagining for a test...each pitcher has slightly different abilities. So, say I have 2 pitchers:
PITCHER A: 1B = 0.2, xbh = 0.1, out = 0.7
PITCHER B: 1B = 0.18, xbh = 0.9, out = 0.73
In both cases, PITCHER B is 10% better than pitcher A at preventing 1B and xbh. However, in the course of a given season, due to random variations, PITCHER A might show better "ability" than PITCHER B in both 1Bbip and xbhBIP. Further, I would expect A to beat B MORE OFTEN in xbhBIP due to the effects I described above. Since xbhBIP will be less indicative of their true talent levels, and more indicative of luck, we would expect the year-to-year correlation in xbhBIP to be lower than that of 1bBIP. We see that in your second test, where the tests go in order. However, as you say, 20 pitchers may not be enough to establish a trend.

Posted 11:22 a.m., August 6, 2003 (#35) - Erik Allen
  One error in my above post... y_avg = 200 also

Posted 12:35 p.m., August 6, 2003 (#36) - RossCW
  Let me just start by saying that I agree with everything you say up to the end of the official quote. You are absolutely right, IMO, to say that correlation coefficients can give us an indication of how predictive a given statistic will be for the next year. For many situations, this is all we really need, since we are simply trying to project next year's performance...my only objection was in trying to relate these correlation coefficients to physical realities of the game (i.e. attributing blame or credit to the hitter, pitcher, or fielder). I think that going down that road is very difficult to justify.

This is a much clearer statement of what I was trying to say.

Posted 1:31 p.m., August 6, 2003 (#37) - Erik Allen
  Tango,

I was very intrigued with your results above, so I repeated the simulations that you performed, to see if our results matched.

First, I ran the case where each pitcher has the same ability: 1b=0.2,xbh=0.1,out=0.7. I used 1000 balls in play as you did, but increased the number of pitchers to 10,000. For this case, I get year-over-year r values of :
xbh = 0.0039
1b = -0.0080
So, essentially no correlation, which is what I was hoping for.

For the second study, I also used 10,000 pitchers. However, in this case each pitcher was assigned a random value of 1B and xbh. For 1B I gave a range of 0.18 to 0.22. For xbh I gave a range of 0.09 to 0.11. So, on a relative basis, these are the same ranges. The correlation coefficients here are:
1B = 0.46
xbh = 0.28
So, from here we can see that there is significantly less predictability in xbh rate, despite the fact that the relative variation in the statistics is approxiamtely the same.

Posted 1:52 p.m., August 6, 2003 (#38) - tangotiger
  Very very interesting!

A couple of things. First, what you are showing is that with a pitcher's singles and extrabase skills consistent year-to-year, that the correlation among a group of 10,000 pitchers with exactly 1000 PAs each was .46 for singles and .28 for doubles.

These figures are virtually IDENTICAL to what I have presented at the top of this page. That is, GIVEN that a pitcher has a set skill, the best year-over-year r that you can hope for is .46 and .28.

More specifically, the year-over-year r that I have presented is consistent with a pitcher having a skill where the range is between .18 and .22 for singles and .09 and .11 for doubles.

Am I reading this right?

What if you extend this to .15 and .25 for singles, and .05 to .10 for doubles?

Will the larger spread in talent among pitchers allow us to get an r to approach 1?

So, I guess what I'm saying is not that the low r is telling you that you've got little consistency, but rather that the low r is showing that you can only get little consistency, simply because the range of talent is so tight.

And that "tightness" is really what DIPS is all about.

This is tremendous stuff Erik! Keep it up.

Finally, can you also give the "r" for the out, the largest component of them all? I'm still not convinced. My guess is that the further you get from .500, the less the "r". So, the "r" of the out (which is .2 from .500) should be slighly larger than the single (which is .3 the other way from .500).

Posted 2:58 p.m., August 6, 2003 (#39) - Erik Allen
  Yeah, I hadn't noticed that above, but the numbers are eerily similar!

As to your previous post, I reran some numbers, and you are absolutely correct:
1B Range: 0.18 - 0.22: corr = 0.46
xbh range: 0.09 - 0.11: corr = 0.28
out range: 0.67 - 0.73: corr = 0.44

1B Range: 0.16 - 0.24: corr = 0.77
xbh range: 0.08 - 0.12: corr = 0.60
out range: 0.64 - 0.76: corr = 0.76

1B Range: 0.12 - 0.28: corr = 0.93
xbh range: 0.06 - 0.14: corr = 0.86
out range: 0.58 - 0.82: corr = 0.93

As you can see, as you increase the spread of ability, you increase the likelihood that the true ordering of abilities will prevail over the course of the season.

As you also predicted, the outs correlation was actually very close to the 1B correlation. However, I am not sure I am ready to cede this point. I say this because the %out probability is NOT independent of the other two probabilities. So, it is not technically a random variable. Does this affect things at all? I have no idea...oh, how I wish I was a statistician.

Posted 3:49 p.m., August 6, 2003 (#40) - tangotiger
  I think what we are prepared to say is that:
- given the spread of "true skill rates" of whatever metric you want, you can estimate the expected "r" year-to-year

- using the sample year-to-year results, you will get an "r" for those samples

COMPARING these two "r" is what establishes to the extent that you can say that a skill exists (in that metric).

So, we can easily have a hits/BIP with an r of .2 and a BB/PA with an r of .7 and in both cases we can say "yes, a pitcher's skill is perfectly represented in those metrics".

It would be good if the Primate statisticians spoke up at this point to add clarity and conviction to what we are saying.

Posted 5:10 p.m., August 6, 2003 (#41) - FJM
  As long as the ability to prevent hits on BIP is viewed as a skill WHICH DOES NOT CHANGE from year-to-year, then your model is an accurate representation of the real world and the correlation coefficient increases with the width of the range of abilities. But that assumes every 0.20 pitcher remains a 0.20 pitcher, every 0.18 pitcher stays right there, and so on. How realistic is that? Well, if the range of abilities is very narrow, then the chance of any pitcher greatly improving (or worsening)is very remote. But if the range is very wide, significant changes in year-to-year ability are certainly possible.

At the extreme, assume your range is very wide (say, 0.12-0.28) but the pitcher's ability in year 2 is completely independent of his ability in year 1. Then the correlation coefficient is 0 by definition. So you can get a small r in either of 2 ways: 1)very small differences in true ability among pitchers with a lot of random variation, or 2)large differences in true ability accompanied by large year-to-year variation in that ability for individual pitchers. DIPS assumes that reason #1 is THE reason for the low correlation. I strongly suspect that #2 plays a very significant if secondary role.

Posted 10:29 p.m., August 6, 2003 (#42) - tangotiger
  To recap, the year-to-year r is dependent on:
1 - how many pitchers in the sample
2 - how many PAs per pitcher in year 1
3 - how many PAs per pitcher in year 2
4 - how much spread in the true rates there are among pitchers (expressed probably as a standard deviation)
5 - possibly how close the true rate is to .5
6 - the true rate being the same in year 1 and year 2

Given all that, the biggest factor in the K "r" being the highest and the XBH "r" being the lowest may be entirely due to #4. That is, the "r" is not explaining #6 anywhere near as much as we think it is.

Someone please slap me awake... it seems that there's about 10,563 Primates that need to give RossCW an apology?!?!?

Posted 11:16 p.m., August 6, 2003 (#43) - David Smyth
  Very interesting. I'll have to read this thread in more detail to make sure I understand.

I don't know what #5 in Tango's post means.

Does all this suggest that, instead of projecting next years H-HR by using a mix of K (r=.8) and BABIP (r=.2), that maybe it would be better to simply use this years H-HR (r=.4, perhaps)?

Posted 1:32 a.m., August 7, 2003 (#44) - jto
  Tango,
So is this verifying that there is small ability in preventing hits on balls on play, and the low year to year correlation of BABBIP can be explained by the "tightness" or range in the true abilities? Seems to me, that this still supports DIPS, in that ability is so small it can be left out in evaluations. What should we be apologizing to RossCW for? I'm not saying we shouldn't..I'm just asking if you could give more of a translation for the DIPS and statistical laymen among us. Thanks

Posted 7:31 a.m., August 7, 2003 (#45) - tangotiger
  I think, maybe, that simply the tightness of the h-hr / BIP (over a career) is what is being explained and not the "pesistence" of ability, based on the "r".

For those of us hoping that "r" was trying to find the signal, that's not what it's doing. The h-hr / BIP is too tight to find a signal.

So, we should use a heavily regressed h-hr / BIP, but not for the reason of "lack of control".

I think.

Posted 8:31 a.m., August 7, 2003 (#46) - Erik Allen
  I agree with tango's post (#42) above. Let me restate it in my own words (to make sure we are on the same page), and then add a few thoughts of my own.

I think the basic lesson we can take from this discussion is: Year-to-year correlation coefficients depend on a lot of factors, including sample size, how often the event (e.g. hit) occurs, and the spread of talent. Therefore, a low correlation coefficient ON ITS OWN is not enough to say that a talent or persistence of ability does not exist. In fact, as the simple simulations I did above show, you can get a VERY low correlation coefficient even when a distinct talent is present.

In response to FJM: I agree that there may be constant changes in a pitcher's (or fielder's, or whomever) ability to prevent hits on balls in play. However, as you say, we can't really separate those changes in ability from the fluctuations in BABIP that are caused due to simple randomness. But this is really true of all baseball statistics. When a pitcher strikes out 12 batters in one game when he averages 7K/9IP, we don't know if this was a random variation, or if the pitcher was really "on" that day. The question is: Can we create a model in which pitcher ability is fixed, and have that model describe the observed variability in BABIP? If you can, then the source of variability is really irrelevant.

Posted 10:53 a.m., August 7, 2003 (#47) - RossCW
  Therefore, a low correlation coefficient ON ITS OWN is not enough to say that a talent or persistence of ability does not exist.

Which is a point I have tried to make several times. But no one owes me an apology. I had a very strong hunch but lacked the statistical knowledge to identify the specific problem.

The question is: Can we create a model in which pitcher ability is fixed, and have that model describe the observed variability in BABIP? If you can, then the source of variability is really irrelevant.

Can you expand on this? I don't really understand whether you are reiterating the point about r's or saying something else.

Posted 11:12 a.m., August 7, 2003 (#48) - tangotiger
  I would think that you create a model where you have known fixed talents, with a range equivalent to what you think MLB has (however you do that, but you can try different reasonable scenarios). And figure out the year-to-year "r" based on this model, and the number of BIP these pitchers have. That essentially gives you the "upper boundary" of r, which may be something like .2 or .25 for hits on BIP.

If in actual life, the MLB r is .18, well then, that's pretty strong evidence of perisistence, right?

I think (again).

Posted 11:43 a.m., August 7, 2003 (#49) - Erik Allen
  In response to RossCW (#47)

I apologize for the misunderstanding...I did not explain my point about the model very well. Here is what I meant:

Pitcher BABIP is variable...it varies from game to game, and year-to-year. As a poster above mentioned, this variability has some element of luck, or random variation. However, it might also be variable because a pitcher's skill at preventing BIP changes from game-to-game or from year-to-year. If BABIP "skill" changes from game-to-game, this we would probably call "streakiness." If BABIP "skill" varies from year-to-year, we could call this "development" or "aging." My point was simply that we can't really separate changes in skill from random variation. To simplify the universe, we simply have to start by assuming that a picher/fielder's skill level is fixed over the course of a multiyear period. If the variability that we see from year-to-year can be explained using this assumption, then we don't really need to worry about the possibility of streakiness.

I think that this is basically what Tango is saying in post 48 as well.

Posted 11:52 a.m., August 7, 2003 (#50) - Erik Allen
  Tango,

I don't know what capabilities you have in terms of the database you are working from, but I was wondering if it would be possible to get data for pitcher seasons broken into groups by number of balls in play?

For example, could you give me a list of all pitcher seasons over the past 20 years where the pitchers had between 100 and 200 balls in play, and so on? I think we could start to understand how much variance there is in BABIP as you increase the number of balls in play.

For the idea I had, I would need the number of pitchers in each group, the average BABIP for the group, and the standard deviation on BABIP for the group.

This is pretty exciting. I think there are some real opportunities to make progress here.

Posted 12:11 p.m., August 7, 2003 (#51) - tangotiger
  Erik, that should be no problem, but there are a couple of issues.

If you've got a pitcher with 800 BIP, chances are that he would be of a certain quality. So, you shouldn't expect a .350 BABIP in this class, based on selective sampling.

If you've got a pitcher with 100 BIP, chances are that the reverse would happen... you'll get lots of .400 BABIP, by luck, and the manager's had enough, and won't put him out there.

That is, I'd expect to find the mean to be different among the classes, and the distribution around them might be skewed based on selective sampling.

I don't know how the spread would be affected.

Send me an email, and I'll send you the file. Unless you need something else, I will give you a file that has:
BIP,1B,2B,3B
for every pitcher by year, 1972-1992, min 100 BIP, and I'll let you select the necessary classes.

Posted 12:11 p.m., August 7, 2003 (#52) - tangotiger(e-mail)
  My email address.

Posted 1:01 p.m., August 7, 2003 (#53) - Erik Allen
  Hmmm...I hadn't considered the possibility of selective sampling, but you make an excellent point. It should be interesting to see what happens to the spread.

Posted 1:13 p.m., August 7, 2003 (#54) - tangotiger (homepage)
  Erik, if you haven't seen it, click on the above link.
It is the career records of pitchers, relative to their teammates, broken down by career BIP classes.

It is very apparent the skew that exists. We just don't know the reason (selective sampling, ability, or both).

Posted 2:16 p.m., August 7, 2003 (#55) - Erik Allen
  Thanks, I had not seen that article.

Just to confirm that you see the same effect on a season-by-season basis, here are the results from the file you sent me:
#BIP BABIP
100-199 0.291
200-299 0.284
300-399 0.284
400-499 0.282
500-599 0.282
600-699 0.278
700-799 0.275
800-899 0.272
900-999 0.268

As you say, either selective sampling is at work, or there is a real difference in ability.

Posted 3:17 p.m., August 7, 2003 (#56) - tangotiger
  Virtually exactly what we expected if we suspected selective sampling.

A couple of points: you should probably at least adjust for the year-to-year league changes in BABIP. Park does play a role, but it's not like pitchers at Dodger Stadium will get more BIP per season than at Fenway. We kinda expect to have 1 pitcher on each team with 900 BIP, on one each with 750BIP, etc, etc.

What's interesting is that after 600 BIP, you are talking about guys with at least 30 starts. So, it's not like the manager will have suffered with a pitcher for 30 starts and then pull the plug on him. Essentially, selective sampling should not play an issue with 600+ BIP.

Therefore, the effect we see from the 600 to 999 classes would probably be due to skill more than anything.

From 200 to 600, it's pretty stable, and that's probably also due to great relievers balancing out the starters who couldn't cut it after bad luck.

The impact that we are talking about is that the great pitchers will have a BABIP of .272 against the league average of .282. That's .01 hits / BIP, or 7 hits per 700 BIP. That's really what we'll be talking about, after the dust settles.

The range of skill is so tight among MLB pitchers that there's little to differentiate at this level (with the metrics currently at our disposal).

The conclusions that DIPS is showing is still supported, jsut that the statistical justification using the "r" is not applicable for those conclusions (at least to the extent that we first thought).

Posted 3:27 p.m., August 7, 2003 (#57) - tangotiger
  Tippett brought this up, and since no one else is going to say anything, I will.

The idea to use the team $H (or BABIP) in place of the player $H is severely flawed (though I have used this process many many times).

Because of what we now know about sample sizes affecting the correlation, it's probable that the reason that the team $H works better than the player $H might be simply due to the team $H being based on a much larger sample (4000 BIP to a pitcher's 500 BIP).

In fact, I would bet that if you randomly took any team $H, and compared that to the next year's pitcher $H, that it would do better than the current year's pitcher $H.

Therefore, if you want to do this "substitution" process to kind of mimic your team's fielders, you should find a pitcher on your team with a similar # of BIP. So, if you've got Steve Rogers with 700 BIP and a $H of .270 and Charlie Lea, with 650 BIP and a $H of .285, then use Lea's $H as your control. I think that would work out better.

If DIPS holds, then we'd expect that the pitcher and his control will have an equal "r" when compared to the pitcher's next year's $H.

If someone wants to do this, you should control for
- both pitchers being on the same team in year x
....(and at least 600 BIP each to kind of circumvent selective sampling issues,
.....and have the number of BIP within say 10% of each other),
- the pitcher being studied to also be on the same team in year x+1 (and also at least 600 BIP).

Anyone want to try?

Posted 6:09 p.m., August 7, 2003 (#58) - Erik Allen
  Well, I have some preliminary results, and they are promising (to me at least), although I don't know how much stock to put into them.

First a little background: As tango points out in his previous posts (56,57), in the PA apperance range of 200-800 BIP, the BABIP rate is very tight (0.275-0.284). I am therefore making the approximation that the pitchers that appear in this sample have a distribution of talent at preventing hits on balls in play. I am furthermore assuming that this distribution of talent is normally distributed (i.e. bell curve shaped) about 0.281.

Here are the statistics I get for all pitcher seasons between 200 BIP and 799 BIP (based on data provided by tango):
# of seasons: 4389
average BABIP: 0.281
Standard Deviation of BABIP: 0.027

The standard deviation is a measure of the spread of the data. Basically, we can say that 67% of all seasons should be between +/- 1SD of average (0.254 - 0.308) and 95% will be within 2SD. Nothing exciting here, this has all been done before.

The question that we cannot answer from the basic analysis above is: what is the standard deviation of "true" talent? For example, do all pitchers simply have the same talent level of 0.281 BABIP? Or, is there some spread to pitcher talent? What is the magnitude of this spread?

To answer these questions, we have to account for the number of trials (i.e. the number of balls in play). So, I have broken down the pitching seasons by balls in play into groups of 100. Listed below are the number of seasons in each group, and the standard deviation of the group:
#BIP #seasons STDEV
200-299 1446 0.032
300-399 812 0.0268
400-499 592 0.0245
500-599 507 0.0221
600-699 579 0.0210
700-799 454 0.0204

From the above table, you can see a clearly decreasing trend in the standard deviation of BABIP as you increase the number of BIP. And, intuitively, we can agree with this idea. After all, in a small number of trials, any number of fluky things can happen, including having a 0.400 BABIP or a 0.150 BABIP. As you increase the number of chances, the likelihood of a really fluky season decreases.

We now have standard deviations of the OBSERVED data broken down by number of balls in play. However, we also know that this observed standard deviation is not equal to the standard deviation of the TRUE talent level. For example, if all pitchers have the same inherent skill level (BABIP=0.281) the stdev of the true distribution is 0. The observed stdev will be something greater than zero.

To figure out what the true standard deviation is that matches the data, we can run a simulation. In this simulation, I set the true standard deviation of the group of pitchers, and measure the output observed standard deviation. Then, I tinker with the set value of the true standard deviation until the output standard deviation I obtain is equal to the observed standard deviation of the group. I can do this for the various numbers of balls in play.

This is already getting long, and I am probably rambling incoherently, so let me simply get to the data. The table below lists the number of balls in play range, and the TRUE standard deviation that would lead to the OBSERVED standard deviation given in tango's data.

#BIP TRUE STDEV
200-299 0.014
300-399 0.014
400-499 0.012
500-599 0.012
600-699 0.012
700-799 0.012

I was ecstatic, to say the least. What I see above is a remarkably consistent picture of pitcher ability. It seems that, as a rough estimate, we can say that pitcher abilities are normally distributed about BABIP 0.281 with a standard deviation of 0.012 or so.

Roughly 2/3 of all pitchers should have a TRUE BABIP rate of 0.269-0.293. Roughly 95% of pitchers should have a true BABIP rate of 0.257-0.305. If this stands up, it is useful because it means that when a pitcher has a season with a 0.250BABIP, we can hypothetically give an estimate of his TRUE BABIP rate.

There are a ton of holes that can be poked in this, being as rough a calculation as it is, and I welcome any and all comments.

Posted 6:31 p.m., August 7, 2003 (#59) - tangotiger
  #BIP #seasons observed STDEV..... expected STDEV if all random
200-299 1446 0.032... sqrt(.28*.72/250)= .028
300-399 812 0.0268... .024
400-499 592 0.0245... .023
500-599 507 0.0221... .019
600-699 579 0.0210... .018
700-799 454 0.0204... .016

That is, if all the pitchers in the 500-599 group were all just pitchers of the same ability, we'd expect 1 SD = .019, while we observe .0221.

However, as Erik is showing us, to match the observed, the spread of the pitchers true talent cannot be the same (even though my 019 and 022 look so close). We are showing that the standard deviation of the true ability must be .012, essentially across all samples. This is a great discovery!!

And this .012 is much higher than I would have expected. This means that 95% of the pitchers are within +/-.025 hits / BIP. At 700 BIP, that works out to +/- 17 hits. This is more than double what I would have expected.

The problem is that even though we have this huge gap, we just can't measure it on an individual pitcher's basis with much reliability (until his career is almost over).

Lots to think about....

Posted 6:44 p.m., August 7, 2003 (#60) - Erik Allen
  Tango,

You mention that the stdev of 0.012 is twice what you would have expected. Can you explain what that expecation was based on? Perhaps I am missing something in my analysis.

Posted 7:37 p.m., August 7, 2003 (#61) - tangotiger (homepage)
  2 things I forgot about: park and fielding.

Go to my site to get park factors. Divide by 2 to simulate half season at a park. Maybe randomly assign a park to your simulated pitchers.

Also assume about 1 sd = .008 hits / BIP for a team of fielders.

Run your stuff again. I think that'll cut your numbers down to half what they are showing.

Posted 8:22 p.m., August 7, 2003 (#62) - jay
  great stuff guys...keep it up!

Posted 9:17 p.m., August 7, 2003 (#63) - Erik Allen
  D'oh! Defense and park factors are always getting in the way of beautifully simple theories.

Tango, your comments in 61 make me realize that I was getting ahead of myself. All the simulation above tells us is that we can match the observed experimental distribution if pitcher BABIP rates are normally distributed with a stdev of 0.012. We have not yet made any claims as to why this distribution of true BABIP rates exists...is it due to the pitcher, the defense, or the park? This is obviously a key question in predicting abilities going forward.

However, I am not sure that I agree with the modifications you mention...I think one of the ideas that you have put forward very clearly during this debate is that we need to reconsider how important the defense and pitcher contributions are to BABIP. If we assign 0.08 to the defense and the rest to the pitching, it seems to me that we have just introduced our own biases into the equation.

I wonder if you have given any more thought to the question of comparing a theoretical correlation coefficient to an experimental coefficient, as a basis of predicting control?

By the way, thank you so much for making that data file...it was amazing how quickly you were able to generate it!

Posted 10:14 p.m., August 7, 2003 (#64) - Alan Jordan(e-mail)
  Warning retread -

Tango invited me to comment on this thread last night. It took a couple of hours to read through all the posts, and check a few of the simululations.

Here is another way of looking at the test-retest correlation in terms of what it’s supposed to measure Player’s ability vs change in ability and how the two should be related to the Rsqr. Of course R is simply the square root of Rsqr.

Part 1
Rsqr=Variance of Player’s ability/(Variance of Player’s ability + Variance Change in ability from year to year).

The above assumes that player’s ability and change in ability from year to year are independent (unobservable ability, not observable performance). It seems reasonable at the moment and I can always generalize it if need be.

I can’t give you a mathematical proof, but I would start it by assuming that change in ability is the error and ability is the model. I can give you this for those of you who are programmers and have some software to stats.

Step1 Generate a variable called X with a variance of 4 (don’t worry about the distribution). Generate in the same step a variable called err with a variance of 9. Create a variable called y as the sum of X and Err.

Step 2. Calculate the Rsqr between X and Y and it will be close to .3.
.3=4/(4+9)

Part 2
What does this mean to FJM’s denominator hypothesis. An event ratio with a small probability (or a probability far above .5) will have a small variance because in binary data the probability and the variance are related - Variance of p=p*(1-p). That small variance will cause a small r. Since singles have a higher p they should have a higher variance and hence a higher r.

Tango – what is the correlation of the logit of these events for each year (natural log of the odds ratio). Is that doable for this weekend? If the variance is proportional to the probability of the event then transforming these data will remove that and get us a better picture. I.e. the r for 1B/PA may be higher than the rest of the other r’s strictly because it has a higher probability.

Erik Allen -

You said -
The standard deviation is given by
STD = sqrt(n*p*(1-p))

This is wrong. STD =Sqrt(P*(1-P)). N doesn’t come in to play.

The standard deviations are:
1BSTD= 0.4
xbhSTD= 0.3

but your conclusion is correct:

“If you happen to locate a statistic that displays a HIGHER year-to-year correlation, …, then this would seem to imply that the differences in player ability outweigh the variability of the statistic.”

Tango-

Anyway, for these 20 pitchers, here are their year-to-year r
2b: .18, 1b: .47, out: .11
Wouldn't we have expected the out, with the highest numerator, to have the highest r, based on your previous explanation?
Don’t read too much into these you have a sample size of 20. If you don’t have a sample size of 20 then you probably did something wrong. My hunch is that even with 1000 pitchers of equal ability, you’re correlations will be insignificantly different from zero.

Erik Allen-
Corr = sum over i [(x_i-x_avg)*(y_i-y_avg)]
You subtracted the mean, but forgot to divide by the standard deviation. The Corr is a covariance of standardized variables. What you have is a covariance of centered variables.

“In your first simulation, all 20 pitchers should have the same ability. Therefore, if pitcherX were ABOVE average one year, we should not expect him to be ABOVE average the second year, and I would think that corr=0 for a sufficiently large sample.”

I think that’s a good insight.

“range of 0.09 to 0.11. So, on a relative basis, these are the same ranges. The correlation coefficients here are:
1B = 0.46
xbh = 0.28
So, from here we can see that there is significantly less predictability in xbh rate, despite the fact that the relative variation in the statistics is approxiamtely the same.”

I ran your simulation and got about the same numbers you did, but I can’t find a transformation that will equalize the r’s. I tried logit, ln and squareoot. Even a non parametric correlation didn’t do the trick. Beats me.

Tango-
You said:

“Will the larger spread in talent among pitchers allow us to get an r to approach 1?”
Absolutely.

FJM-
You said:
“But that assumes every 0.20 pitcher remains a 0.20 pitcher, every 0.18 pitcher stays right there, and so on. How realistic is that? Well, if the range of abilities is very narrow, then the chance of any pitcher greatly improving (or worsening)is very remote. But if the range is very wide, significant changes in year-to-year ability are certainly possible.”
No, the abilities and changes in abilities are assumed to be independent.

“So you can get a small r in either of 2 ways: 1)very small differences in true ability among pitchers with a lot of random variation, or 2)large differences in true ability accompanied by large year-to-year variation in that ability for individual pitchers”
Exactly

Tango-
You said:
“ To recap, the year-to-year r is dependent on:
1 - how many pitchers in the sample
2 - how many PAs per pitcher in year 1
3 - how many PAs per pitcher in year 2
4 - how much spread in the true rates there are among pitchers (expressed probably as a standard deviation)
5 - possibly how close the true rate is to .5
6 - the true rate being the same in year 1 and year 2”
All are true but number 1. The number of pitchers effects the standard error or the precision of our estimate. With only 20 pitchers, our estimate of r might be too high or too low, but r itself remains unchanged.

Posted 10:38 a.m., August 8, 2003 (#65) - tangotiger (homepage)
  Erik,

To simulate park, that's easy enough. Just go to the above link. We see that the stdev for park is .0085. Since they play half their games at home, the "seasonal" park adjustment would be .004.

We definitely have to simulate fielding, but the question is "how"? If I look at team-level UZR, on a year-by-year basis (n=120 over 4 years), the stdev is about .0100 (but you need to regress somewhat). If I take it on a multi-year basis (1999-2002, n=30), the stdev is .0070. Since teams do turnover, I think the answer lies somewhere in-between, I'd guess. So, I'd make that .008. (I'd guess that if you even just used ZR, or any other measure, you'll get similar results.)

If you were to run your simulation where you set the standard deviation of the park to .004 and the fielders to .008, we can figure out what's left over for the pitchers.

Now, you can try running your sim so that fielding is set to .006 or .010 or anything (reasonable) you want really. So, you can say that "if fielding stdev is .006, pitching stdev is .007... if fielding stdev is .008, pitching stdev is .005", or something alone those lines.

This is really exciting! We can finally come up with the proper "split" between fielding, pitching, and park.

My original guess would have been a 4/3/2 split between fielding/pitching/park.

Posted 1:44 p.m., August 8, 2003 (#66) - Erik Allen
  Okay, I see where you are going now.

I have a few questions/comments, all of which can be dealt with (I think).

1) Both the park effect and the defensive deviations you mention are the observed standard deviations, correct? If so, I would think that the measured stdev is larger than the true stdev as in the general case. I think we can account for this somehow.
2) I just read the UZR Primer by Mitchel Lichtman. However, he focuses on individual performance, not team performance. Do you have a good article relating to team UZR?
3) It appears that UZR ignores certain outcomes (pop flys?) which would not give credit to pitchers who were able to induce lots of pop flys. I am worried this (or other effects) might give more credit to the defense than is due.
4) Has anyone done a comparison of year-to-year correlation of pitchers who remain with the same team, versus pitchers who change teams? This seems like it might provide some insight into how much control a pitcher has.

Posted 2:15 p.m., August 8, 2003 (#67) - tangotiger
  1) Both the park effect and the defensive deviations you mention are the observed standard deviations, correct? If so, I would think that the measured stdev is larger than the true stdev as in the general case. I think we can account for this somehow.

The link I have for the park effects on DER is over a 17year period, or about 80,000 BIP per park. Feel free to regress whatever your sim would say to regress. That is, run your sim giving each team 80,000 BIP, and try to match the observed. I have to believe that you won't regress more than 5%.

2) I just read the UZR Primer by Mitchel Lichtman. However, he focuses on individual performance, not team performance. Do you have a good article relating to team UZR?

On my site, I have MGL's file by player, pos, team, year. My results I just published was based on this data.

3) It appears that UZR ignores certain outcomes (pop flys?) which would not give credit to pitchers who were able to induce lots of pop flys. I am worried this (or other effects) might give more credit to the defense than is due.

That's a fair point. For every one ground out (as opposed to ground ball), there is 0.7 flyouts and 0.3 line outs and pop outs (about evenly distributed). Again, if you want to set aside a certain percentage of BIP as fielder-independent, that's a good idea too.

4) Has anyone done a comparison of year-to-year correlation of pitchers who remain with the same team, versus pitchers who change teams? This seems like it might provide some insight into how much control a pitcher has.

Yes, and I think the Tippett article also examined that. If I remember, he said the year-to-year r of pitchers who switched teams was .09.

Posted 11:20 a.m., August 9, 2003 (#68) - Erik Allen(e-mail)
  Okay, I have run a few simulations to try and tease out some of the park, defense, and pitching dependence. I will break this info into different posts, to avoid excessive length.

Posted 11:42 a.m., August 9, 2003 (#69) - Erik Allen
  Before getting into more complicated simulations, I thought it would be appropriate to first look at some "extreme" cases.

The first question one might ask is: Is the pitcher ENTIRELY responsible? The answer is almost certainly no, but it might be instructive to see what kind of results you would expect if such was the case.

What I did in this set of simulations then, was to randomly assign 10,000 pitchers a BABIP skill level (for example, 0.281, or 0.250, etc.). These skill levels are normally distributed about 0.281 with a standard deviation of 0.012 (as found previously to fit the data). We assume that this BABIP level is ALWAYS their true level. Then, I simulate 2 separate seasons, and record and OBSERVED BABIP level for each pitcher each season (this would correspond their true major league performance). I then measure the correlation coefficient for the year-over-year data, and compare it to the correlation coefficient tango found in his study (0.15, see post 6).

Okay, wordy, I know, so let's get to the data: In the data file tango sent me, there were 4389 pitchers that had between 200 and 800 BIP in a given season. The average number of BIP was around 430. Therefore, I let each pitcher have 430 BIP in each season.

#BIP; 430 for both seasons, for all pitchers. r = 0.24

As we would expect, the correlation coefficient is too large. One modification we could make to change the outcome slightly, would be to assign different pitchers different numbers of plate apperances, to more closely reflect reality. When I do this (e-mail me if you want more methodology), I get r = 0.21. Still too large.

The "Well, duh!" conclusion, is that pitchers BABIP talent does not lie solely with the pitcher.

Posted 11:55 a.m., August 9, 2003 (#70) - Erik Allen (homepage)
  A second extreme case would be to ask if the data can be explained solely on the basis of park factors. That is, the pitching has no influence, the defense has no influence, only the effect of the park determines the BABIP rate.

Tango has a list of BABIP park factors at his website (see homepage link). You divide these factors by 2 to get a team's park effect over the course of a season. The standard deviation of this distribution is 0.004.

In 2002, the average team had around 4550 BIP over the course of a season. Using the same methodology as above, I assign each team a BABIP level based on a normal distribution with standard deviation 0.004.

The correlation coefficient for this case is r = 0.25.
The year-to-year correlation coefficient on a team level is more like r = 0.6 So, clearly, the park is not the only factor either.

Posted 12:15 p.m., August 9, 2003 (#71) - Erik Allen(e-mail) (homepage)
  In Tippett's article on BABIP (see homepage link), he finds that the correlation coefficient for pitchers, _relative to their team_, is 0.09. Now, I am starting to get on shaky ground here, but if I assume that all pitchers are affected in the same way by a given defense or a given park (not entirely true obviously), then we can view BABIP relative to the team as a measure of pitcher ability.

To do the simulation, I assume that pitcher ability, relative to their team enviroment, is normally distributed. I try different standard deviations of this distribution, and measure the resulting correlation coefficient:

stdev_talent r
0.006 0.061
0.007 0.082
0.008 0.11
0.010 0.16
0.012 0.21

Based on the chart above, it appears that a standard deviation of talent of around 0.007 to 0.008 would be appropriate.

Posted 12:26 p.m., August 9, 2003 (#72) - Erik Allen(e-mail)
  I wanted to write one more post concerning some ideas for future reasearch, most of which I would not know how to do:

1. One study that could potentially confirm this work would be to study the year-to-year correlation of pitchers who had between 200-800 BIP and remained on the same team both years. For this group of pitchers, one can conceivably imagine that the talent level of (pitcher+defense+park) would be fairly consistent year-over-year. Would you get a correlation coefficient closer to 0.21 for this group (as in post 69)?
2. Look closer at UZR to determine what an appropriate defensive stdev is (this I think I can do, after some more thought).
3. What year-over-year r do you get for pitchers that have changed teams (not sure if there is enough data for this.) Tango mentioned that Tippett addressed this, but I was not able to find it mentioned.

Posted 12:27 p.m., August 9, 2003 (#73) - Patriot
  As you allude to, the idea that a defense would effect all the team's pitchers evenley is not really one that we can know for sure. I would guess that there would be large differences. A team with good outfielders will be better for a flyball pitcher. But what if they have Jeter and Baerga playing SS and 2B? The groundball pitchers are going to suffer. I have no problem believing that there is some pitcher influnce on $H, but I don't think the team correlations prove this.

Posted 12:46 p.m., August 9, 2003 (#74) - Erik Allen
  Patriot, I think you are correct is your assessment. In retrospect, post 71 hasn't really proved much, because you still have to make some key assumptions reagarding the interaction of pitching and defense.

I don't really have a ton of good ideas beyond it, however.

Posted 3:43 p.m., August 9, 2003 (#75) - tangotiger
  I would say that the stdev for team fielding would be .008, as I noted earlier.

I suppose we can break down UZR by "IF" and "OF" and get stdev by that level, as an approximation for "GB" and "FB".

Then, we can use the GB and FB rates of the pitchers to figure out the extent to which the IF or OF is impacting them. (Sid Fernandez would be impacted by the OF or FB stdev more than the IF or GB stdev. So, you give Sid and Gooden et al the same OF stdev, but that stdev would apply to Sid more than anyone else.)

I'll guess that the stdev for GB and FB rates is .04. Virtually all pitchers are within .12 of the league average.

Posted 7:54 p.m., August 10, 2003 (#76) - FJM
  The GB/FB distinction is important for assessing the impact of fielding on a pitcher's BABIP. But no less important than separating lefties from righties, at least among the starters. In 2002 Arizona's 3 RHP starters saw LHB's in 47% of their BFP, so the fielders were about equally tested. In contrast, Brian Anderson saw LHB's only 26% of the time. And Randy Johnson? Only 15%.

Ironically, both Anderson and Johnson had higher BABIP's against the lefties than the righties. In Randy's case it was almost a wash (.295 to .291). But Brian was hammered by lefties at a .323 clip, compared to .288 by the righties. It would be interesting to know how much of that disparity was attributable to his fielders.

Posted 8:54 p.m., August 10, 2003 (#77) - Tangotiger
  Ok, how about we split up the fielding factors by team by position.

Then, give each pitcher his own ball distribution.

THEN, you can figure out the effect the fielding has on the pitcher.

Posted 10:54 a.m., August 11, 2003 (#78) - tangotiger
  Let me take a few steps back. We're almost to the point that if we start doing all this (accounting for fielding by position and accounting for ball distribution by pitcher), we are really just doing UZR and PZR, and therefore, no need to do this sim analysis.

Therefore, I suggest that to proceed in baby steps, that we assume that the pitchers have the same ball distribution, and that the fielders on the same team are equals as defenders.

Once we get those results in, we can start adding layers like taking into account ball distribution, and individualized fielding.

Posted 1:18 p.m., August 11, 2003 (#79) - FJM
  I didn't make my point as clear as I should have. I don't think it is necessary to take the ball distribution down to the individual pitcher level. All 3 RHP's faced 47% LHB's, so the only important distinctions among them are GB/FB and Power/Finesse. On the other hand, not only did the D'backs lefties see a lot less of LHB's, the mix was very different between the two of them. Even here, I think the GB/FB and Power/Finesse splits would probably be enough of a split, although the Big Unit really should be in a class all by himself. But the LHP/RHP split is fundamental.

Of course if you start from the assumption that all fielders have equal ability, then nothing else matters. But that assumption is so far removed from reality that I'm afraid it will significantly understate the standard deviation attributable to fielding.

On the other hand, as with pitchers, I don't think you need to simulate separate fielding ratings for each position on the team, much less for every individual at each position. The left side of the infield could be rated as a unit. The right side could be treated similarly as far as assists are concerned, although the first baseman's ability to turn bad throws into outs might require a separate rating. The outfield could get by with 2 ratings, a LF-CF rating and a CF-RF rating. Finally, I don't think you need to rate the catcher at all for this purpose. The difference between a great catcher and an average one (or even a bad one) generally comes down to some combination of stolen base and wild pitch/passed ball prevention, neither of which affects BABIP. (Of course, pitch selection is part of it too, but that is generally credited to --- or blamed on --- the pitcher.)

Posted 2:28 p.m., August 11, 2003 (#80) - tangotiger
  Please note that I meant that each fielder on the same team would be "equals", but that each team of fielders would follow the .008 standard deviation that UZR says it is.

Like I said, start off with baby steps, and work your way up.

Posted 10:15 a.m., August 12, 2003 (#81) - Erik Allen
  Sorry for the delay since my last post...I had to do some "real" work.

I ran the simulations that tango suggested. That is, I introduced a random, normally distributed defensive factor for each player. Tango sets the standard deviation of the defensive contribution at 0.008. However, since I didn't know the exact basis for this number, I ran the simulation under 2 assumptions:

Case 1:
Assumptions:
1. 0.008 is the _observed_ stdev of defensive talent, AFTER ACCOUNTING FOR PARK EFFECTS.
2. The talent of a defense is independent of the park they play in (i.e. the park effect and the defensive ability of the team are independent variables).

In Case 1, we need to determine the true standard deviation of defensive ability, since the observed standard deviation is larger than the true standard deviation. To do so, I ran my simulation at different levels of true standard deviation, and measured the output stdev (each team was given 4550 BIP). I get a true standard deviation of 0.0045.

After doing this, I can compute the true standard deviation of pitcher ability. I use the same tactic as in previous posts, changing the true stdev to match the observed stdev for different levels of BIP. The stdev of pitchers in Case 1 is 0.010. Table 1 presents some data I get from such an analysis:

#BIP Simulation_stdev Real_life_stdev
250 0.0308 0.0321
350 0.0266 0.0269
450 0.0241 0.0245
550 0.0225 0.0221
650 0.0211 0.0210
750 0.0202 0.0204

As you can see, the simulations stdev matches the "real life" stdev for every case expect for the pitchers in the 250 BIP range. This is a problem I have been having fairly consistently. I think it could be explained by a number of factors:
1. 100 BIP is too wide a range to use at such a low number of BIP
2. The range of talent for pitchers in this group is larger

Anyway, for case 1, we see the pitching/defense/park breakdown is 0.010/0.0045/0.004

P.S. I also calculated a correlation coefficient as described previously (pitchers assigned 200-800 BIPs according to the major league distribution). I get r = 0.20. Still too high (as expected)

Case 2:
Assumptions
1. 0.008 is the true distribution of defensive talent
2. Defensive talent is independen of park

The only difference from case 1 is that I now use 0.008 for the stdev of defensive talent. Using the same procedures as above, I get a pitcher stdev of 0.007. See table for simulation details:

#BIP Simulation_stdev Real_life_stdev
250 0.0306 0.0321
350 0.0266 0.0269
450 0.0240 0.0245
550 0.0222 0.0221
650 0.0209 0.0210
750 0.0199 0.0204

Case 2, the pitching/fielding/park breakdown is 0.007/0.008/0.004

P.S. I also calculated a correlation coefficient as described previously (pitchers assigned 200-800 BIPs according to the major league distribution). I get r = 0.19. Still too high (as expected)

Posted 11:03 a.m., August 12, 2003 (#82) - Erik Allen
  Actually, one modification to the above numbers...

For case 2, 0.007 appears to be a bit low for the pitcher distribution estimate. 0.0075 appears a little better, and 0.008 might work also...the estimates are not perfect.

Posted 1:57 p.m., August 12, 2003 (#83) - tangotiger
  Erik, SUPERB stuff! I think as a rule of thumb that on balls in play, fielding/pitching are 50/50, based on your analysis in Case 2.

Now, what if case 1 is more representative? You asked how I got the ".008" as the true expected. In my post #65, I said the following:
If I look at team-level UZR, on a year-by-year basis (n=120 over 4 years), the stdev is about .0100 (but you need to regress somewhat).

Therefore, if you want to rerun to establish what the "true rate" based on this "observed rate", remembering that we've got n=120, perhaps we will find that the true rate standard deviation will be .007 rather than .008 for fielding. My guess is that if you rerun using .007 for fielding, that you'll get .008 for pitching.

In any case, even if you stop here, I think you've added a tremendous amount of knowledge to this.

Our current best-guess is that fielding and pitching are (more or less) equally impactful on BIP.

The revelation about what an "r" really is is also incredibly important to us non-statisticians.

If you were to rewrite your research and analysis, I'd be glad to post it here, or send it to the "home page" of Primer.

Posted 11:04 a.m., August 13, 2003 (#84) - Jonathan
  Sorry for being slow.

If the pitcher and defense's impact on BIP is 50/50, that would make the Pitcher's impact on overall run prevention what? 2/3, 3/4? Or is that not something that can be meaningfully calculated (should have stayed awake in statistics class...)

Posted 12:32 p.m., August 13, 2003 (#85) - bob mong
  Great stuff, everyone!

A few minor notes:

Alan Jordan wrote: You said -
The standard deviation is given by
STD = sqrt(n*p*(1-p))

This is wrong. STD =Sqrt(P*(1-P)). N doesn’t come in to play.

Actually, you are both wrong. First of all, that equation for STD is wrong; sample size does matter. To quote from my statistics text book ("Applied Statistical Methods," Carlson and Thorne, 1997):

Page 49 (in the box): "The sample standard deviation, s, and the population standard deviation, σ, are defined as

s = squareroot(s²) = squareroot( (sum[over i] (x_i - X_avg)²) / (n - 1) )

and

σ = squareroot(σ²) = squareroot ( (sum[over i] (x_i - µ)²) / N)"

Where x_avg is the sample mean, n is the sample size, µ is the population mean, and N is the population size.

This is applicable for any and all distributions; it is the definition of standard deviation. Notice that the sample size is, indeed, a factor.

For specific distributions, there are formulas that you can use to eliminate all that tedious summing, squaring, and square-rooting (is that a word?) -

For example, on page 189, in the box, is given the variance of the binomial distribution (the standard deviation, as I am sure you are all aware, is the square root of the variance):

"and the variance is

σ²_x = (1-π)"

Where "n is the number of independent Bernoulli trials and π is the probability of success for each Bernoulli trial."

Which would imply that the standard deviation, assuming a binomial distribution, is:

squareroot((1-π))

However, Tango was right when he said that the numerator doesn't matter, only the denominator. The sample size matters, not the number of successes.

And furthermore, Tango was also right when he wrote that the closer you are to 0.5, the larger the standard deviation. That follows from the formula:

π, π × (π - 1)
0.1, 0.09
0.2, 0.16
0.3, 0.21
0.4, 0.24
0.5, 0.25
0.6, 0.24
0.7, 0.21
0.8, 0.16
0.9, 0.09

As the probability gets further from 0.5, the standard deviation will become smaller, given identical sample-sizes. That is why the standard deviation of the out is smaller than the STD of 1B, and that is why the STD of XBH is smaller than 1B.

Make sense?

Posted 7:43 p.m., August 13, 2003 (#86) - Chris R
  Erik -

Is this an accurate synopsis of you ran the simulations in Case 2 of post #81:

a) For each pitcher, randomly select a park factor. This park factor is normally distributed about 0 with a standard deviation .004.

b) For each pitcher, then randomly select a defense factor. This factor is normally distributed about 0 with a standard deviation of .008.

c) Randomly select a (H-HR)/BIP talent value for each pitcher. These values are normally distributed about the sample mean, and have a standard deviation equal to the stdev value being evaluated.

d) Sum the park factor, defence factor, and talent value for each pitcher, then conduct 250-750 bernoulli trials with E(X) = park + defense + pitcher for each pitcher.

e) Compare the observed standard deviation of (H-HR)/BIP in the simulation to the observered historical value. If the stdev numbers match up, the talent deviation number is assumed to be close to the actual MLB value.

If I have missed anything here, I'd love to hear about it. I have some ideas about this, but I'll keep my mouth shut until I can be reasonably certain I know what I'm talking about.

Posted 8:01 p.m., August 13, 2003 (#87) - Chris R
  Oh, and I apologize for not mentioning in my first post that this is great work that has been done so far.

Posted 8:37 p.m., August 13, 2003 (#88) - Chris R
  One thing that has been considered is that parks and defences affect different pitchers differently. I believe you can account for this without complicating the simulation. The defence and park factor standard deviations currently being used seem to be on a team level. The standard deviations for those values (currently .008 and .004) should be calculated at the pitcher level, and should be higher as such.

The park factor deviation could be estimated using the standard deviation of pitcher park factors (Home BABIP - Road BABIP) in the sample set. Unfortunately, I don't have a play by play database built, so I can't determine this number.

The pitcher specific defence deviation is not easy to determine, but we should keep in mind that it will be higher than the overall devence deviation.

One other thing you might consider doing is removing the 45-odd Charlie Hough, Joe Niekro, and Phil Niekro seasons from the data set. It is fairly well accepted that knuckleballers differ from other pitchers on BIP averages, and they are easily identifiable. While they make up a tiny number of the pitchers in the sample, their numbers probably inflate the sample stdev by significant amount.

Posted 1:26 a.m., August 14, 2003 (#89) - Arvin Hsu
  guys, this is fantastic stuff. I dropped in 2 weeks ago, and have been too busy finishing my Master's thesis to pop back in. Boy was I surprised to see how far you guy's have taken this. outstanding job.

It seems we have a statistical model now:
k ~ Bin(n,p)
p ~ Norm(mu,sigma)
sigma ~ Chi-sqr(tau)
mu = alpha-pitcher + alpha-park + alpha-defense
alpha-pitcher ~ Norm(pitcher-sample-mean,pitcher-variance)
alpha-park ~ Norm(0,park-variance)
alpha-defense ~ Norm(0,defense-variance)

Since in any one year we have a different mix of park/defense/pitcher, we can evaluate the model

That's awesome, and it can be evaluated. Someone can evaluate it classically if they want. After I get my Master's(Sept) I'll grab Lahman2002 and run it through a Bayesian engine. I should be able to get probability distributions for alpha-park for every ballpark, and alpha-defense for every team-year, and alpha-pitcher for every pitcher across seasons.

Posted 1:36 a.m., August 14, 2003 (#90) - Arvin Hsu
  Other notes:

1) I did this a couple years ago and posted to r.s.b, I think. You shouldn't need to do year-to-year correlations. You should get _much_ better estimates of pitcher ability if you aggregate seasons. Keith Law came up with original idea, iirc. So, here's how you do it: pitcher-A: odd-season $H correlated with pitcher-A:even-season $H. You can use Lahman to exclude all pitchers with ooh.. <800 IP, and exclude all seasons with <50 IP. That should give you a few hundred data points, that may turn up with much stronger correlations than you've been getting.
Actually, now that I think about it, you don't need to exclude <50IP seasons, you just aggregate the seasons before calculating $H.
I'll do this in Sept, too, if no one gets around to it earlier. I predict much higher r^2, than you've been getting.

2) binomial population variance vs. observable sample variance.
I need to think about this. Chris, this may be where you had been planning to go. I figure we should be able to calculate, rather than simulate what our sample variance _should_ be. This would be an "exact" formula, in the words of Carl Morris.
I'll think about this for a few days, and post back if I've got it.

-Arvin

Posted 1:50 a.m., August 14, 2003 (#91) - Chris R
  I figure we should be able to calculate, rather than simulate what our sample variance _should_ be.

I had considered this, but I don't think it is a problem. I certainly think it would be interesting to calculate, rather than simulate, what the pitcher variance should be, but if the simulation is done correctly, the answers should be very similar. Eliminating the simulation variance would help, but the dependancy between park and defence factors, and the lack of a good estimate of defence variance relative to pitchers are larger issues.

Posted 2:10 a.m., August 14, 2003 (#92) - Arvin Hsu
  one other clarification point Re: binomial distribution

X ~ Bin(n,p) is the distribution for Random Variable X.
X is the expected number of events to occur with prior prob. p.

When ppl quote std-Bin = sqrt(n*P*(1-p)), this is the std for r.v. X.
It is _not_ the std-dev for p-hat = X/n, or what ppl use as the best-estimator for the population parameter p.
p-hat is a proportion.
X is an integer.
The variance for p-hat is calculated as follows:
Var(aX) = a^2*Var(X)
p-hat = X/n
Var(p-hat) = Var(X/n) = Var(X)/n^2
Since Var(X) = std(X)^2 = n*p*(1-p),
Var(p-hat) = n*p*(1-p)/n^2 = p*(1-p)/n
std(p-hat) = sqrt(p*(1-p)/n)

Note that this is the SAMPLE variance and standard deviation.
p-hat is the best estimator for the population parameter p.
std(p-hat) is NOT an estimator for the population variance of p.

What does this mean for us? It means as N increases, the sample std-dev of the sample will decrease. This explains, perfectly, Erik's findings in post #58:
BIP #seasons std(p-hat)
200-299 1446 0.032
300-399 812 0.0268
400-499 592 0.0245
500-599 507 0.0221

to take two numbers:
n=250, std(p-hat )= .032
n=500, std(p-hat) = std(p-hat(n=250))/sqrt(2) = .032/1.41 = .0226
You're std(p-hat) for n=550 is .0221!!!

The question you really want to ask is: what is the relationship between p-hat, or std(p-hat) and my POPULATION Variance for p?? That question is a bit more difficult, and is mentioned in my previous post. a start, at least:
pop-variance should be distributed Chi-Sqr.
One next step would be to calculate the MLE(Maximum Likelihood Estimator) for pop-variance.
There are other estimators, as well, though MLE should be fine.
It should be calculated already, if anyone wants to look it up in a stats textbook.

That estimator only begins to answer the questions posed, since we need to then combine with a model for multiple values of n, in order to find an estimator for the population variance.

-Arvin

Posted 2:22 a.m., August 14, 2003 (#93) - Chris R
  I've mentioned defence variance relative to pitcher a couple of times now, and I am not sure I have explained clearly what I mean by this.

Each of the data points used to determine the defence stdev number of .008 used in the most recent simulation represents the performance of a single team's defence over 162 games. However, it is being used to estimate the performance of a defence behind a single pitcher over the course of a season. Even if every defence treated all pitchers equally, the defence-pitcher variance would be larger than the defence-team variance because of the smaller sample sizes involved. Combined with the fact that we can reasonably assume that defences do not treat all pitchers equally, the defence-pitcher variance should be even larger yet.

I don't see a simple way to estimate defence-pitcher variance, but I have an idea for moving closer to it. Erik's most recent estimation of .007 for pitcher-talent stdev should represent an upper bound for the actual value of pitcher-talent stdev. With that, you could calculate an observed BABIP variance for teammates, then turn the simulation around and calculate a lower bound for intra-team defensive variance. That number could then be combined with the current estimate for inter-team defensive variance to produce a new lower bound for pitcher-team variance. Repeat the process a couple more times, and you might have a better simulation.

Of course, there might be a closed form solution for determining these values, but I won't hold my breath.

Posted 2:26 a.m., August 14, 2003 (#94) - Arvin Hsu
  damnit, stop drawing me away from matlab.

Chris:
I had considered this, but I don't think it is a problem. I certainly think it would be interesting to calculate, rather than simulate, what the pitcher variance should be, but if the simulation is done correctly, the answers should be very similar.
True. In fact, as I mentioned, it's difficult to calculate it with the differing N's, and the simulation seems to work well enough. The two reasons to do it are: 1) interesting, as you said, 2) it would give us a better idea on what factors affect the observed std(p-hat).

Eliminating the simulation variance would help, but the dependancy between park and defence factors, and the lack of a good estimate of defence variance relative to pitchers are larger issues.
Totally agree. But they can be unlinked. The big problem is ballpark v. defense. I think the key here is to take multiple seasons, and hold ballpark constant, but defense as adjustable.
But, you don't have enough degrees of freedom. Hrmm...

Each pitcher on a team shares the same defense each year. They also share the same home ballpark each year. How do you disentangle home ballpark from defense without using pbp data? Home/away splits for each pitcher would do it, but that's still unavailable/not easy to crunch. you can't use offense numbers to calculate alpha-park, there's a dynasty bias.
10 seasons * 5 starting pitchers = 50 data points
1 park factor, 10 defense factors, 5 pitcher factors = 16 variables
plenty of room to get good estimates.

but once you collapse pitchers on a team, you get
10 seasons = 10 data points
1 pf, 10 defense factors = 11 variables.
and that's not identifiable.

grrrr... any suggestions?

Posted 2:35 a.m., August 14, 2003 (#95) - Arvin Hsu
  Chris,

could you define your proposed variables a little clearer?
So far, I understand:

defence-team variance (.008)
pitcher variance (est. at .007)

Everything else in your post, I'm a bit fuzzy on:
What is defense-pitcher variance?
Is defense-team = inter-team variance,
is defense-pitcher = intra-team variance?

-Arvin

Posted 2:47 a.m., August 14, 2003 (#96) - washerdreyer
  I cannot say I understand everything going on in this thread, but even what I have been able to glean seems really outstanding. With this, the music thread, the civil war thread, and the California recall thread running, this is just a fascinating time to be on Primer.

Posted 2:54 a.m., August 14, 2003 (#97) - Chris R
  For the sake of continuity, I'll start spelling defense like everyone else.

Defense-team variance (.008) is what you have called alpha-defense (It sure would be nice to have a greek keyboard right now).

Defense-team = inter-team variance,
Defense-pitcher = f(Var(inter-team),Var(intra-team))

Unfortunately, I'm not sure what f is.

Var(intra-team) = sum over i (pitcher-i $H - team $H)^2 /(n-1)

Basically, a pitcher's results will differ from his teammates results, and his defense will differ from other team's defenses. These two degrees of separation from the average pitcher could be estimated by a single random variable.

BTW, I envy your access to matlab. I'm stuck here contemplating writing simulations in java.

Posted 5:46 a.m., August 14, 2003 (#98) - Sylvain(e-mail)
  Excellent, tremendous stuff, congrats to every participant.
Should automatically deserve the primey for best thread.

Sylvain

Posted 7:18 a.m., August 14, 2003 (#99) - Tangotiger
  To split GB/FB: I mentioned earlier that we can probably use a mean of .50, with a standard deviation of .04. I'll confirm that later.

I will redo the UZR observed calcs, splitting between IF and OF (to approximate GB/FB). I'll guess that we'll get a std dev of .015 observed for each.

For the park, for the IF/GB, I have to believe that the effect is almost all grass/turf. From that standpoint, you would do something like +.002 grass, and -.002 turf or something. If someone wants to look at the DER factors I put up, you can probably make a good guess at that. Maybe.

So, for the OF/FB factors, you'll probably get .006 or .007 for the standard dev.

So, as the next baby step, in addition to the steps Erik has taken, you randomly assing a pitcher a grass or turf park (based on the 1972-1992 teams), and you randomly assign him a GB/FB tendency, and you randomly assign him an OF/FB park factor.

Posted 7:19 a.m., August 14, 2003 (#100) - Tangotiger
  And this is where now Erik has to split things in 2: based on what is GB/FB rate is, you'd have to give say your first pitcher 2000 GB and 2550 FB, and then use the appropriate park and fielding factors for each of GB and FB.

Posted 10:51 a.m., August 14, 2003 (#101) - Arvin Hsu
  Tango:

It seems like you're introducing a lot of variables. Are you planning to use pbp data to identify all the values?
It seems like you want to say each pitcher is a combination of
the following over 21 years(72-92):
one of 28+ park factors
one of 28*21 IF defense factors
one of 28*21 OF defense factors
his own pitcher IF factor
his own pitcher OF factor

Is this correct?

You're using up df very fast this way, and may run into identifiability problems. Also park factors may also have an IF/OF split

-Arvin

Posted 10:55 a.m., August 14, 2003 (#102) - Arvin Hsu
  yah, it looks like 28*2 park factors.
adding it up:
assume 500 pitchers, avg. of 5 years per pitcher
data points:
5*500 = 2500

variables:
28*2 = 56
28*21*2 = 1176
500*2 = 1000
total: 2232

Well, 2232<2500 which makes it theoretically identifiable. But
this is immense.

-Arvin

Posted 10:56 a.m., August 14, 2003 (#103) - tangotiger
  Among the 752 pitchers with at least 1000 PA (average of 3,589), the standard deviation to their GB rates was .078.

Among the 183 pitchers with at least 5000 PA (average of 8,106), the standard deviation to their GB rates was .066.

Among the 36 pitchers with at least 10,000 PA (average of 12,705), the standard deviation to their GB rates was .058.

My guess is that if you were to run your "expected" to "observed" sim using these numbers, that you would get the "expected" stdev of .04.

Posted 11:14 a.m., August 14, 2003 (#104) - tangotiger
  The standard deviation, observed, over 120 team-years:

whole team: .008
IF only: .010
OF only: .013

The rest of this take with a grain of salt, since I had to make some assumptions. Anyway, by position, over 120 team-years, here are the stdev:
2b/ss: .015
3b/lf/cf/rf: .023
1b: .030

Now, what can you do with this information, besides what we've talked about? Well, you can FINALLY answer the question: is fielding talent at a position independent on a team level? That is, do teams seeing that they have a bad SS counter that with a great 2B? Or, are the talents at the positions randomly distributed?

Well, once Erik or someone confirms what the "expected" stdev is based on these true rates at a position level, you can then see if using these values as independent variables will match the observed at either the IF/OF level or at the team level.

My guess is that teams DO treat positions rather independently.

Should be fun to find out...

Posted 11:31 a.m., August 14, 2003 (#105) - tangotiger
  Actually, Arvin, I'm assume "league average" for everything else. For example, I already published the DER park factors over the 21 year span. The standard deviation (50% home, 50% road) was .004. I'm assuming that over that many years and BIP that the observed and expected would come in at pretty much the same thing.

A pitcher takes a random point inside this DER park factor, and when applied with the pitcher's expected DER rate, and his sample (say BIP=600), this will match the observed DER rates (which I sent to Erik).

So, now we're extending that. We're saying that a pitcher will have a random GB rate which we're taking from the stdev observed of .06, which we have the "true" rate as probably .04. We split up his BIP into 2. Instead of the park factor DER, we use the park factor IF or OF DER. Since the observed at the IF/OF level is around .011 or so, the expected might be .006.

etc, etc...

Posted 10:42 p.m., August 14, 2003 (#106) - Erik Allen
  Tango,

Thanks for all the data. I will try and run some simulations, and maybe have results this weekend or Monday.

I was wondering (if this is easy to compute) what r value you calculate for year-to-year correlation of pitchers that did not change teams? Is that information available, or too tough to get at?

Posted 8:39 a.m., August 15, 2003 (#107) - Erik Allen
  Tango, questions from post #104:

1. The standard deviations you are providing here are for UZR, correct?
2. In post 65, you list the overall team standard deviation at 0.010, whereas here you list it at 0.008. Did you get different results the second time, or did you write 0.008 because this was already the value agreed upon?
3. How many opportunities do teams typically get for ground balls and flyballs? Overall, the total BIP is around 4550, but what is the IF/OF or GB/FB breakdown? Ideally, a distribution would be best (i.e. 10 teams had 2000-2100 FB, 5 teams had 2100-2200 FB, etc.) but just average numbers would be okay for a first pass.
4. Same question as number three except broken down by position.

Posted 8:47 a.m., August 15, 2003 (#108) - Erik Allen
  To Arvin and Chris R:

Sorry for the delayed response, but I am not a statistician, so it took me a while to understand what you are saying. :)

In response to Chris R, post 86: You sum up my methodology. Your summary is essentially correct.

In response to Arvin, post 92: Not entirely sure I understand what you mean, but essentially, I think you are saying that the observed variance is not the same as the population variance. I agree with this totally, and is one of the main things I am taking into account in my simulation. However, if you have a way to calculate it analytically, all the better.

To Chris R, post 93: I agree that defenses affect pitchers differently, and this needs to be the next step. This ultimately means I cannot select a defense and a pitcher ability randomly. However, I disagree that the true variance of the defense depends on the number of opportunities. The true (population) variance) is independent of number of BIP, but the observed variance will depend on sample size.

Thank you both for your insightful comments, and once again, sorry that I did not respond sooner.

Posted 8:51 a.m., August 15, 2003 (#109) - Erik Allen
  Tango, one more question:

5. Are the standard deviations you provide already park-adjusted? This probably won't make a significant difference, but just asking.

Posted 9:50 a.m., August 15, 2003 (#110) - tangotiger (homepage)
  Excellent ball distribution data can be found at the above. Essentially, this is what I used, even though it doesn't exactly correspond with the same time period.

I did not park adjust any of the figures I supplied. I will rerun my standard deviations by pos to make sure I've done it correctly.

Yes, I use UZR for everything.

The .010 was the observed standard dev for n=120, and .008 was the "agreed expected true", which you can confirm using your process.

Posted 10:41 a.m., August 15, 2003 (#111) - tangotiger
  Ok, reworking my numbers to match the Levitt numbers, this is what I get for standard deviations. n=120.

Both: .009
IF: .013
OF: .013

rf: .020
2b: .020
ss: .021
lf: .022
cf: .026
3b: .031
1b: .032

(Doing a weighted average of the above, and we get a value of .024. I think for ease, we should consider the standard deviation on a per-position basis to be the same and equal to .024. Erik, it's your time, so do whatever you figure you can handle.)

These standard deviations are all observed and need to be sim-ed or calculated to determine the "true rates".

There's something that looks strange with the Levitt numbers. For example, the BABIP against SS comes out to .066, while against CF it's .454.

I do know that the BABIP for GB and FB are more or less similar (about .030 off, with the value lower against OF). But the Levitt numbers show a BABIP of under .100 for IF and over .500 for OF. Do some balls hit into the OF count as GB? Is this what I'm missing?

Posted 11:15 a.m., August 15, 2003 (#112) - tangotiger
  When I think about it, each of those positions need to be regressed a different amount, since the opps for each position is different. So, first, we'd have to do the sim process to get to the true rates for each position. THEN, we can do a weighted average if we want a uniform true rate to use.

Posted 11:40 a.m., August 15, 2003 (#113) - Mike Emeigh(e-mail)
  I do know that the BABIP for GB and FB are more or less similar (about .030 off, with the value lower against OF). But the Levitt numbers show a BABIP of under .100 for IF and over .500 for OF. Do some balls hit into the OF count as GB? Is this what I'm missing?

Yes.

The Levitt numbers are based on the identity of the player who initially fielded the ball. A ground ball through the SS hole will be initially fielded by the left fielder, so in the Levitt study that ball will count in the LF's totals.

-- MWE

Posted 3:36 p.m., August 15, 2003 (#114) - Arvin Hsu
  1) Regarding UZR: you guys are way over my head. I haven't looked at the methodology close enough to begin to understand it.

2) tango: Actually, Arvin, I'm assume "league average" for everything else. For example, I already published the DER park factors over the 21 year span. The standard deviation (50% home, 50% road) was .004. I'm assuming that over that many years and BIP that the observed and expected would come in at pretty much the same thing.

Why would the STD of both stats be the same? .04 std for DER should be stat dependent, and shouldn't have anything to do with the population std of the defenses using this binomial p model that Erik is using.

-Arvin

Posted 3:58 p.m., August 15, 2003 (#115) - tangotiger
  Arvin,

I don't know what you are talking about! Please re-explain.

What I am saying is that for the *park factors* DER, those splits were based on 21 years of data, comprising about 80,000 BIP per team. So, if I say that Fenways is +.020 hits / BIP compared to a non-Fenway park, my guess is that this observed difference will be pretty darn close to whatever "true" difference would produce this observed difference over 80,000 BIP. Are we talking about the same thing here?

As for UZR, just think about ZR or DER instead. We are simply talking about how many extra outs a fielder makes / BIP. The standard deviation, on the observed team-level data (n=120) is .010. Broken down by position, the observed standard deviation (n=840) is .024.

If you regress a certain amount, or calculate the "True" rate using this sim process, my guess is that the true standard deviation that produces those observed figures would be .008 for team fielding and .015 for positional fielding.

Mike: interesting. Can you provide the "plays,outs,hits in zone" (however you want to define it), split by GB/FB, by position for any year that you have it?

Posted 6:43 p.m., August 16, 2003 (#116) - Arvin Hsu
  Tango...

thanks for the explanations. The methodology makes sense now.

As for formulas... I think the numbers you've been simulating, Erik, can be approximated by assuming that the variance's add.
IOW,
k ~ Bin(n,p)
p ~ Norm(0,sigma^2)
Var(k) = n*p*(1-p)
Var(p-hat) = p*(1-p)/n <---- this is what we expect the Binomial to contribute to our data.
Observed Variance of Data = Var(p) + Var(p-hat)
= sigma^2 + p*(1-p)/n

So... Using your data: sqrt(p*(1-p)/n+.012^2)
200-299 1446 0.032: .0309
300-399 812 0.0268: .0269
400-499 592 0.0245: .0244
500-599 507 0.0221 .0226
600-699 579 0.0210 .0213
700-799 454 0.0204 .0203

------------------

Also, if you have multiple sources of variance, they will also
add similarly:

Total True Variance = True Defense Variance+TrueParkVariance+TruePitcherVariance

.012^2 ~= .0075^2 + .008^2 + .004^2 = .0117^2

I'm not entirely comfortable with the simplification that the Norm Variance and the Bin Variance will linearly add, but it appears to fit the data well.

-Arvin

Using your data: sqrt(.0075^2+.008^2+.004^2) = .012

Posted 6:44 p.m., August 16, 2003 (#117) - Arvin Hsu
  btw, Tango...

in your most recent post you said UZR team std = .010,
whereas in Erik in his prior calculations was using .008.
Which one is correct?

-Arvin

Posted 8:26 p.m., August 16, 2003 (#118) - Tangotiger
  .010 is the observed stdev, which we simed (or mentally regressed) to .008.

Arvin, is that a statistical equation? Because it is brilliant and simple! Pythag move over, make way for the Arvin theorem.

Posted 10:49 p.m., August 16, 2003 (#119) - Tangotiger
  Arvin's theorem is intriguing. For example, I mentioned that the observed stdev for IF and OF was .013, and for the team it's .009 (according to my post 111).

Let's see what happens with this new equation, and realizing that half the BIP are IF and half are OF (let's say).

Observed team ^ 2 = [(.013/2) ^ 2] + [(.013/2)^2] = .009 ^ 2

Wow!

How about if we use the .024 for each of the 7 positions? Following the same process, and we get: .009!

Holy moley!

Now, if you want to really impress me, tell me how to get from the observed stdev to the true stdev. That is, how much do I regress towards the mean, given the sample size? Do I make it k/sqrt(n)? How do I know what to set k to?

Posted 10:53 p.m., August 17, 2003 (#120) - Arvin Hsu
  Tango: Like I showed above:

k ~ Bin(n,p)
p ~ Norm(0,sigma^2)
Var(k) = n*p*(1-p)
Var(p-hat) = p*(1-p)/n <---- this is what we expect the Binomial to contribute to our data.
Observed Variance of Data = Var(p) + Var(p-hat)
= sigma^2 + p*(1-p)/n

Hrmm... now that I re-paste it, I can see that it's a bit impenetrable. How's this?

Observed Variance = Binomial Variance + True Variance

So...

(Obs. Std)^2 = p*(1-p)/n + (true Std)^2

And this is what the chart that I posted after that showed for Erik's numbers. Oh, p is the avg. rate for the binomial: eg. .281, but it would be different if the team fielded at, say, .300, or whatever...
And n is obviously the number of BIP.

-Arvin

Posted 11:17 p.m., August 17, 2003 (#121) - Tangotiger
  Ah, got it now. Tremendous stuff.

So, to go back to first my team-level data. I showed an observed stdev of .009 for my 120 teams, each of which has about 4500 BIP. In your equation above, is n=120, or n=120x4500 or n=4500? If n=120, then how do you account for a team having 4500 BIP or 62 BIP?

Posted 1:46 a.m., August 18, 2003 (#122) - Arvin Hsu
  Tango:
n=4500 in your example. You're data has 120 observations of a
Binomial distribution where n=4500. The std-dev observed is the
std-dev calculated on the binomial distribution, and is an estimate
of the combination of the binomial variance and the underlying(true)
variance.

-Arvin

Posted 9:45 a.m., August 18, 2003 (#123) - tangotiger
  Good stuff again.

So, we have

.090^2 = .28*.72/4500 + true^2

that makes the true std dev at the team level as: .090

Actually, even after only 450 BIP, the true stdev rate comes in at .087.

Am I doing this right?

Posted 9:49 a.m., August 18, 2003 (#124) - tangotiger
  Oops... that should be .009. Working it out again, and we get: .006

So, that's the fielding.

.012 ^2 = .006^2 + .004^2 + pitching^2

pitching = .010

So, are we saying that each pitcher has a .010 stdev, each team of fielders is .006?

Posted 10:05 a.m., August 18, 2003 (#125) - tangotiger
  Continuing in the same vain:

true team fielding ^2 = true 1b fielding ^ 2 + true 2b fielding ^ 2...

.006 ^ 2 = [(t/7)^2] * 7

(That is, each position is on average getting 1/7th of the plays, and there are 7 positions. See post 115 for more info.)

t = .016 = true avg single fielding position

So...... the true standard deviation for a single position is about .016. The true standard deviation for pitchers is .010. So, on any given BIP, the fielder has more influence than the pitcher.

On a group of BIP, the pitcher has more influence than the team of fielders.

Anyway, since we know that range of fielders UZR runs is about +/- 30 runs (and since we know that their stdev is .016), then I would make a guess that pitchers with a stdev of .010 would have a range of +/- 20 runs. That is probably our best guess as to the influence of the pitcher on BIP.

Just taking a wild guess, but if the range is +/- 20 runs, then 1 stdev is probably +/- 6 runs. So, we expect say 95% of pitchers to be +/- 12 runs.

Since our best interpretation of BABIP shows that a pitcher's skill is about +/- 8 runs, for 95% of them, then the BABIP is not a good enough metric to capture the real skill that a pitcher has on the influence of BIP.

Posted 10:14 a.m., August 18, 2003 (#126) - tangotiger
  Sorry for the continuous posts, but I'm writing faster than I'm thinking. That last step you should ignore, as it uses different denominators.

Anyway, since we've established for fielders that 1 stdev is .016, and if they average about 650 plays each, that gives us 1 stdev = 10 plays per season, or about 8 runs. That's 1 stdev for fielding runs for an average fielding position.

For pitchers, 1 stdev is .010. The average full-time starter will have 700 BIP, and the average reliever will have 200 BIP. For the starter, .010 stdev on 700 BIP = 7 plays per season, or about 6 runs. That's 1 stdev for pitching runs for an average full-time starter. That means 95% of pitchers have a skill at preventing hits on BIP to the tune of 12 runs per 700 BIP.

I believe that our current interpretation of BABIP is that 95% of pitchers have a skill to the tune of 8 runs per 700 BIP, but I'll have to look that up again.

Bottom line? Pitchers have the skill, not as much as an individual fielder (60/40 split on a single BIP), but they have more skill than a team of fielders (60/40 split the other way for a season of BIP). And the BABIP metric is not good enough to capture this skill.

Which is why we need PZR to find their skill.

Tremendous work by Erik and Arvin!!

Posted 1:50 p.m., August 18, 2003 (#127) - tangotiger
  Erik, Arvin, and anyone else who has contributed to this thread: I was thinking of doing a writeup of this entire thread as an article, hopefully citing everyone's work at the appropriate places. This has really been an eye-opener for me, and perhaps having a (hopefully logically ordered) detailed summary of the really incredible work by Erik and Arvin would be the reference point for DIPS going forward. (Arvin, I've already got Erik's email, so please send me your email address.) I don't know about anyone else, but I call this a sabermetric orgasm!

Posted 3:21 p.m., August 18, 2003 (#128) - David Smyth
  That's a great idea, Tango. If it were written up, it would be easier for everyone to understand the material, including myself.

Posted 6:55 p.m., August 18, 2003 (#129) - Arvin Hsu
  tango: that sounds like a good idea.

I have a few thoughts on the theory that could be included. I'll post later about it.

-Arvin

Posted 11:01 p.m., August 18, 2003 (#130) - Arvin Hsu
  tango:

wow... you're thoughts move faster than mine:

.006 ^ 2 = [(t/7)^2] * 7

(That is, each position is on average getting 1/7th of the plays, and there are 7 positions. See post 115 for more info.)

t = .016 = true avg single fielding position

I read 115, and I still don't understand what you're doing here...
isn't it: .006^2 = [t^2]*7??
That means that on any one play, a fielder gives .0023?

Here's how I interpret it:
.006 is the std(variance) of defensive ability, independent of park or pitchers. If every fielder contributes .0023 std (variance), than that creates a teamwide std (variance) of .006.

Alright, but on any one play, the ball can only be fielded by one fielder, sometimes two, very very rarely three. So what does that mean?

.006^2 is the variance of the underlying distribution that governs the binomial p... On one play, or one Bernoulli trial, you have a p that is determined by 1) pitcher's alpha, 2) park's alpha, and 3) defense-alpha.
3) defense-alpha ~ Norm(0,.006^2)
Thus, 65% of teams have a defense-alpha within +-.006.

What about on a play? That entire variance should be attributed to the fielder, I think, or divided in half(using Pythag) and given to both fielders, or all 3 or however many are involved in the play.

That means that if you have 7 fielders, all at +.006, then you would have teamwide defense at ... +.006. Because on any one play, the ball only goes to one of the fielders.

-Arvin

Posted 7:08 a.m., August 19, 2003 (#131) - Tangotiger
  I think you're right, but I've got to think about it for it to sink in. Makes sense though...

Posted 10:32 a.m., August 19, 2003 (#132) - Arvin Hsu
  I think it comes down to why we're adding variances:
In a normal distribution, you add when you add two random variables.
eg. X ~ Norm(3,.05^2), Y ~ Norm(4,.03^2)
Z = X+Y
Z ~ (7,.05^2+.03^2)

Although I'm not yet convinced it works this way for a normal p affecting a binomial, the sims seem to bear it out. Why? Well it makes sense. On some level, you're adding two r.v.'s, the binomial
and the normal. So, on any one play, who can affect the ball? Only our 1-3 fielders. Thus, the fielder's variances can never really _add_, since they each affect _different_ plays. We add park +pitcher + defense specifically because all 3 affect the _same_ play.

-Arvin

Posted 10:34 a.m., August 19, 2003 (#133) - Arvin Hsu
  In fact, what you probably have is a single fielder std somewhere around .0059, and it averages out to .0060 once you add in plays where two fielders can make the play, as well as three. If you have these numbers we can try to calculate it, but I figure .006 is strong enough an approximation for now.

-Arvin

Posted 11:10 a.m., August 19, 2003 (#134) - tangotiger
  Yes, I think I am agreeing with you. Since our assumption is based on fielding talents on single fielders only, I think we can stick with .006 (though again, I don't see this being impacted to a significant degree if we look at SS to 1B throws, or 2B to SS DPs, etc).

So, what we are saying is that we have a 10/6/4 split between pitching/fielding/park, in that order. Luck plays a part, and that is dependent on the sample size. When n=1, it's almost all luck. When n=1 million, luck is not involved.

So, over 700 BIP, where we observed a .020, we have the following:

observed ^ 2 = .010 ^ 2 + .006 ^ 2 + .004 ^2 + luck ^2 = .020 ^ 2
solving for luck = .016

So, can we say that when a starter has 700 BIP, the influence on those BIP as a group can be broken down by:
luck : 44%
pitch: 28%
field: 17%
park : 11%
??

I have to admit that I've recently said, though I don't remember where, that I thought the split would be 40/30/20/10 with the order being luck,fielding,pitching,park.

What we are saying here is that pitching and not fielding is the larger determinant between the two. And perhaps before I read about DIPS I might have had the correct order.

I think it's still important that yes we need to separate the components (HR.BB,K) from the BIP, as Voros does. But, the conclusions drawn from that does not stand based on the reasoning.

I think our best conclusions would be the follows:
1 - pitching has more impact on BIP than does fielding
2 - luck has more impact than anything, over 700 BIP
3 - BABIP is not a good enough measure for the pitcher's skill

What would be interesting is that if MGL or Tippett or someone with pbp data gets around to implementing the PZR blueprint I published (the flip side to UZR), that we'll get closure on this subject. That is, we should be able to get the standard deviations on the pitcher's side that will support the data we are inferring here.

So, before we trample in any direction, it may be worthwhile to keep the case open, pending final data. After all, we may have made a serious miscalc somewhere.

Posted 11:46 a.m., August 19, 2003 (#135) - tangotiger
  Someone asked me about the implication of all past DIPS work. I responded the following:

=====================
I'm not really sure of the impact. It's still a blur
to me as to what use to make of it.

What we are saying is that there are 2 components for
a pitcher: his non-fielding dependent skill (HR.BB.SO)
and his fielding-dependent skill.

We know very well how to estimate the former, and not
very well the latter. Since the BABIP figure is not
reliable for an individual pitcher, it's more accurate
to use say 50% lg, 40% team, 10% pitcher to estimate
his expected BABIP. But, that estimate will come with
a very wide margin for error.

The conclusion stands that you need to separate
things, and you can't rely on a pitcher's past BABIP
to predict the future (much like you wouldn't use his
ERA). Still outstanding is WHAT to use for BABIP.
I'll contend that PZR would be that measure. But,
that has yet to be implemented by anyone.
=====================

Posted 1:59 p.m., August 19, 2003 (#136) - Arvin Hsu
  observed ^ 2 = .010 ^ 2 + .006 ^ 2 + .004 ^2 + luck ^2 = .020 ^ 2
solving for luck = .016

Or... you could just say:
Binomial distribution, n=700, p = .281
std: sqrt(p*(1-p)/n) = .0169

the extra .009 is probably rounding error, since we only have one significant figure on a lot of these numbers.

-Arvin

Posted 2:47 p.m., August 19, 2003 (#137) - tangotiger
  Actually, the observed should have been
600-699 579 0.0210
700-799 454 0.0204

So, at 700 BIP, I should have used .0207. Reworking, and we get a nearly perfect match.

Posted 4:31 p.m., August 19, 2003 (#138) - Arvin Hsu
  so... back to figuring out the impact of a single player.
Anyway, since we've established for fielders that 1 stdev is .016, and if they average about 650 plays each, that gives us 1 stdev = 10 plays per season, or about 8 runs. That's 1 stdev for fielding runs for an average fielding position.
If we redo this approximation, we get:

(.006)*650 = 3.9 balls/season.

Most ppl would agree that the number seems low. 65% of starting fielders are within +- 4 plays/season from average? It definitely seems low. So how to account for it?

Well, let's start by pointing out the weaknesses in the approximation:
a) BIP are not distributed evenly. SS's obviously get more BIP than RF'ers, or 1B-men.
b) +-.006 is 1 STD for the team defense. This means that individual positions may have a higher STD.

Let's also point out two general items which will bias judgment:
a) 1 STD is just that. One standard deviation ~ The best players may be 2 or 3 or even 4 standard deviations from average(Erstad, anyone?). At this point, we have no reason to believe that the underlying skill distribution is normal. It may, in fact be, student's t, or Beta, or whatever, and the tails may be considerably longer. However, many ppl may see the small STD and say that it just can't be, without realizing the statistical significance of it.

b) How errors are counted. As I understand it, an Error is counted as an out from the pitcher's POV. The presence or absence of the error is not present in our binomial calculations at all, except as another BIP that resulted in an out. I think, that, to evaluate fielding, we will have to add error rate as an extra factor on top of the +-.006 std for fielders, thus giving fielders 17% responsibility for the $H, but an added responsibility for converting "oughtta_be_outs" into "actual_outs." This, shouldn't affect our DIPS calculations, but _will_ affect our interpretation of UZR as a "control," so to say, since IIRC, UZR counts Errors as balls that the fielder failed to get to, which will increase UZR variance rates, but will not affect our DIPS variance rates.

-Arvin

Posted 4:40 p.m., August 19, 2003 (#139) - Arvin Hsu
  We know very well how to estimate the former, and not
very well the latter. Since the BABIP figure is not
reliable for an individual pitcher, it's more accurate
to use say 50% lg, 40% team, 10% pitcher to estimate
his expected BABIP. But, that estimate will come with
a very wide margin for error.

The conclusion stands that you need to separate
things, and you can't rely on a pitcher's past BABIP
to predict the future (much like you wouldn't use his
ERA). Still outstanding is WHAT to use for BABIP.
I'll contend that PZR would be that measure. But,
that has yet to be implemented by anyone.

The answer, I hope, will be around the corner. I'll be free after the first week of Sept(Master's finished), and I'll be able to do some more in-depth work. I've been planning to run some data models of hitter's, using component rates over career arcs inside a Bayesian framework, but I'll put that on hold now. What we have is the structure of a tangible data model that is experimentally testable. I should be able to enter DIPS data into a Bayesian engine that will simulate(MCMC) and establish likely data points for all pitchers, parks, and team-defenses. I'll need home/away splits to do that, but that should be all the pbp data necessary. We should come up with some very nice estimates that would be perfect for predictions.

IOW, the simulation should come up with precise distributions(mean/std) for each pitcher, each ballpark, each defense. And that would be the numbers you use to predict next year's $H.

-Arvin

Posted 5:02 p.m., August 19, 2003 (#140) - tangotiger
  Arvin,

Please note that each individual fielding position has a true standard deviation of about .016, and an observed of .024, for n=120.

The .006 figure is the TEAM standard deviation for fielding.

So, on a player basis, it's .016 x 650 (or 10). On a team basis it's .006 x 4500 (or 27).

Posted 5:04 p.m., August 19, 2003 (#141) - tangotiger
  Also note that if each team of 7 fielders were independent and randomly chosen, you would get
team ^ 2 = fielder ^2 x 7

And, 10 ^ 2 x 7 = 700
27 ^ 2 = 729

Close enough.

Posted 12:04 a.m., August 20, 2003 (#142) - Matt Goff
  Posted 10:32 a.m., August 19, 2003 (#132) - Arvin Hsu
I think it comes down to why we're adding variances:
In a normal distribution, you add when you add two random variables.
eg. X ~ Norm(3,.05^2), Y ~ Norm(4,.03^2)
Z = X+Y
Z ~ (7,.05^2+.03^2)

Although I'm not yet convinced it works this way for a normal p affecting a binomial, the sims seem to bear it out. Why? Well it makes sense. On some level, you're adding two r.v.'s, the binomial
and the normal.
...

I have not had the time to figure out everything that has been talked about so hopefully this comment is relevant:

It seems to me that your concern about the addition of variances may be alleviated by recalling that a binomial is approximately normal for large n (thanks to the good old Central Limit Theorem). I can't remember the exact rule of thumb, but it seems like n*p>25 or something like that. In any case, if I understand what I have been reading, the n should be quite adequate for the normal approximation to hold.

Posted 8:58 a.m., August 20, 2003 (#143) - tangotiger
  I just want to interject something to keep in mind. Remember, our equation is

trueDER ^ 2 = truePitch^2 + trueField^2 + truePark^2

Erik has provided trueDER from his sim, and Arvin has confirmed it with his "observed" equation, and that is .012. I've provided the truePark figure as .004. Dropping all the decimals, and our equation becomes

128 = truePitch^2 + trueField^2

Based on UZR, which I'll have to go over because I'm not sure I'm using the right numerator (Levitt's numbers might include HR), the observed single position UZR is around .025 and the observed team fielding UZR is around .010. So, our true UZR will be somewhere between .005 and .008, probably.

We're not even sure that UZR is the best thing to use, but it is the best thing available at the moment. (You could even use ZR, and I'm pretty sure you'll get a single position observed stdev of .025 for your regular players. This is easy to eyeball since the range of players is mostly around +/-.05 outs/BIP, so that would be 2 standard deviations.)

Anyway, so we've got something like
128 = truePitching^2 + [5 to 8]^2

So, when fielding = 8, pitching = 8.
When fielding = 5, pitching = 10

etc, etc.

So, depending on how the fielding measure is determined and manipulated, a small change there will have a huge impact in the relative value between fielding and pitching.