Tango on Baseball Archives

Request for statistical assistance (December 17, 2003)

I would like to make use of the statistical-savvy Primates, if I may. (Click on discussion, and page up.)

I computed the "catcher deltas" for WP, PB, SB, BK, Pickoffs, etc. This probably could have been done using logistic regression (as probably any "matchup" can be done), but I did it the only way I knew. For every pitcher, I figured out the rate for the above categories with my catcher in question and without (using career totals). So, I get my catcher deltas.

Now, if I would have included some benign category, like, SS errors or something (think of a good one), we would expect that the catcher deltas for this category would be completely due to luck. For their observed deltas, we will get some catcher with a +1, and another with a -2, etc.

The standard deviation of the "SS errors" deltas should come out to exactly that of the standard deviation based on luck. Now, if the two standard deviations are exactly the same, that tells me that I can regress 100% of the deltas to infer the true rate from this observed rate.

Now, my question. What if the standard deviation of the "balks" deltas is twice as large as the luck standard deviation? This tells me that the catcher DOES have some influence on controlling balks (for whatever reason, whether the pitcher is more comfortable with a strong catcher, or a runner will stay closer to the bag so that the pitcher will not try to pick him off, etc, etc). Not important.

Can I use the distribution of the deltas compared to the distribution that would have been generated by random to figure out a regression towards the mean figure? Can I make use of the standard deviations of the deltas to infer the true rate from the observed rate?

I appreciate any assistance those statistical-savvy can give me.
--posted by TangoTiger at 05:41 PM EDT

Posted 5:52 p.m., December 17, 2003 (#1) - Arvin Hsu
short answer: yes, you should.

long answer:
depends on your assumptions, specifically what the statistical model is behind these "catcher deltas." Tango, what exactly are the "catcher deltas", and how did you calculate them? I don't remember the previous catcher breakdown thread too clearly.

Posted 6:14 p.m., December 17, 2003 (#2) - Arvin Hsu(e-mail)
Feel free to email me if you want.

Posted 8:26 p.m., December 17, 2003 (#3) - tangotiger (homepage)
Check the above homepage link.

***********

This is what I'm thinking as an amateur:

If stdev of random = 1, and the catchers' deltas is = 2, then:

2 ^ 2 = true ^ 2 + 1 ^ 2, making true = 1.73

So, to turn the observed standard deviation from 2 to 1.73, I regress all the observed values by .27/2 = 13.5%. Am I doing this right?

Posted 8:37 p.m., December 17, 2003 (#4) - Arvin Hsu
That uses the same formula we used for figuring out defensive contributions. _That_ formula was an approximation, but it worked.
This one, I don't know. Let me read up on your catcher's delta tonight.
I may have a statistical model tomorrow.

-Arvin

Posted 8:51 p.m., December 17, 2003 (#5) - Alan Jordan
If I understand this correctly, you can still do a logistic regression with the data you have. You can also use linear regression to give you an approximate answer (that will probably be pretty close).

Let each y be the proportion/rate
Let independent variable 1 be the pitcher.
Let independent variable 2 be the catcher.
Let the weight be the denominator of the rate stat.

If you use the linear regression, you can show how much of the rsquare is caused by pitchers and how much by catchers.

As long as you don't specify interaction terms in your model, you get estimates that have some degree of regression to the mean in them (in large samples anyway).

A logistic regression could be run the same way, i.e. without exact matchups.

You're method has two problems.

1. You have no proof that it works (if it even does).
2. Nobody would understand what you're doing.

Posted 12:00 a.m., December 18, 2003 (#6) - tangotiger
Alan,

I wasn't interested (for the moment) in trying to redo the catcher study. I was just using the results of that to try to establish if I can use the standard deviations of the observed results to determine the regression towards the mean.

As for this comment:
You're method has two problems.

1. You have no proof that it works (if it even does).
2. Nobody would understand what you're doing.

Are you referring to the method in my article? For #2, it's rather straightforward, and many people understand what I'm doing. As for #1, I'm satisfied with my processes in everything I do. I'm not looking to prove it beyond however I present it.

Posted 1:25 p.m., December 18, 2003 (#7) - Alan Jordan
Tango,

No I'm not refering to the method in your catcher's article. As far as I can tell it's unbiased. The method I'm talking about is attempting to use SS errors to regress balks.

You're making the assumption that balks has the same variance of error as SS errors. The variance of error for these events should be proportional to the rate of events/PA. If they have different rates then forget it. Even if they have the same rates, I think you still have to make the argument that they are equivalent. Also there are other sources of error such as age of pitcher/catcher that might add to the variance of error. We simply don't know what part of the total variance for balks, etc... is error variance or true variance. The same is true for SS errors.

In short, while it is plausible that your method for regression to the mean might work, depsite the objections that I've made, I wouldn't attempt to use it until it can be shown to work.

Posted 2:16 p.m., December 18, 2003 (#8) - tangotiger
Thanks for the clarification Alan.

Can you expand on error variance and true variance, with some examples?

Posted 2:22 p.m., December 18, 2003 (#9) - tangotiger
The method I'm talking about is attempting to use SS errors to regress balks.

Actually, I didn't explain it properly. I was making the distinction between 2 things. For example, if I used SS errors, and I get a standard deviation of the observed value of .026, and if the expected standard deviation of SS errors was also .026, then I can say that the SS errors deltas are completely random, from the perspective of the catcher.

For Pitcher Balks, the observed standard deviation was 2 per 162 GP, while I expected, from a purely random distribition, to be 1 per 162 GP. Therefore, I can't say that the catcher doesn't influence the pitcher's balks totals. He DOES. But, I don't know to the degree that he does.

This is why I'm asking how to make use of the standard deviation of the observed deltas against the standard deviation of random expected deltas.

Apologies for not making it clearer. Even more apologies if I'm still not clear!

Posted 7:49 p.m., December 18, 2003 (#10) - Alan Jordan
First, regression to the mean using a year to year r works when r represents the ratio of true variance to total variance. In theory if you know the true variance or the error variance you could construct a ratio because total variance can be measured from the data directly.

For example if the total variance for a variable was 10 and we knew that the error variance was 8, then we would get a ratio of 1/5.

Total var=err var + true var
true var=total var - err var
true var=10-8
true var =2

ratio=true var / total var
ratio=2/10
ratio=1/5

We could then use the ratio 1/5 to multiply the differences of each score - mean

if the mean were 50 and score1 was 100 then, using the 1/5 ratio you would get
adj score=mean+(score-mean)*1/5
adj score=50 + (100-50)*1/5
adj score=60

I don't really know of a situation where we would know the error variance. For batting average, we would know the binomial part for battting average and we might be able to figure out the year to year variance.

I thought that was what you were trying to do. If not, nevermind.

"For Pitcher Balks, the observed standard deviation was 2 per 162 GP, while I expected, from a purely random distribition, to be 1 per 162 GP."

Is this right? Do you mean standard deviation or rate? Stan Dev isn't usually expressed in terms of successes per trial. Also observed stan dev is usually larger then an adjusted std dev or var.
I don't get what you mean by purely random distribution, is it binomial, normal or what?

In general, I would recommend that you go back to post #5. It will tell the average effect that catchers have on balks and other events per PA. It will also give a hypothesis test and you can even figure out how much the model's rsquare is soley attributable to all the catchers.

Actually, I'm not 100% sure what you're doing, but then I'm sick today.

Posted 8:28 p.m., December 18, 2003 (#11) - tangotiger
Is this right? Do you mean standard deviation or rate?

I could, and should, have marked the rates and standard deviation of the rates as a per play, but I put it at per 162 GP. It makes a little more confusing, but I find it easier not to deal with decimals if I don't have to.

I think there's something that's being missed between you and I. How about we try a simple example.

I flip 1000 coins with my kid in the room, and I get 520 heads. I flip 1000 coins with my kid sleeping, and I get 490 heads. So, I give my kid a +30.

My 10,000 other friends do the same thing. Some get +50, others -20, others +10, others -5, etc, etc.

So, question 1: what is the observed standard deviation of the deltas of these 10,000 flippers? ( I guess it would help if I give you the data.)

question 2: what is the expected standard deviation of the deltas, assuming that only luck is expected.

If the standard deviations of the deltas of the two questions above are the same, then I would say that the kids have no effect on the flipping, and so, I can regress the deltas at 100%.

If on the other hand the SD of the deltas of the first question is twice the SD of the deltas of the second question, then THERE is an effect. So, my question #3 is how to get a regression equation for that? Is it as simple as post #3?

Thanks again...

Posted 8:47 a.m., December 19, 2003 (#12) - tangotiger
I have 100 people, who each flip 1000 coins. The expected win/loss is 50-50. The standard deviation of the "true talent" of these flippers is known to be zero (i.e., pure luck).

The experiment yields a standard deviation of the deltas for this sample at 32.

I then invite everyone's kids, and they repeat the process. The expected win/loss is still 50-50. The standard deviation of the "true talent" of these flippers with kids is known to be .028.

The experiement yields a standard deviation of the deltas for this sample at 64. I ask them to flip again, and again I get a standard deviation of 64.

I then run a "sample1-to-sample2" correlation. The r was .73, meaning that I want to regress 27%.

Can I get that 27% in other ways?

observed stdev ^ 2 = true stdev ^ 2 + err stdev ^ 2
64 ^ 2 = true ^ 2 + 32 ^ 2

(32 ^ 2) / (64 ^ 2) = 25%

So, I think it's a rather simple step to establish what the regression towards the mean figure, given only the standard deviation of the observed, and the stdev of the random.

Am I on to something here?

Posted 5:25 p.m., December 19, 2003 (#13) - Alan Jordan
There's a couple of things here so let me do them in order.

1. "I could, and should, have marked the rates and standard deviation of the rates as a per play, but I put it at per 162 GP."

std dev and var are customarily expressed simply as numbers not in units of per play, per AB, dollars or inches. That's what I meant. Stan dev can be expressed in the same units as rates, but they usually aren't. Variance would have to be expressed in those units squared. It confused me that's all.

2. "So, question 1: what is the observed standard deviation of the deltas of these 10,000 flippers? ( I guess it would help if I give you the data.)"

That's easily measured by using the standard deviation formula on all of your deltas and each delta appears to the difference between your score and someone elses.

3. "question 2: what is the expected standard deviation of the deltas, assuming that only luck is expected."

Assuming everyone is using a fair coin then the expected stan dev is
sqrt(p*(1-p)/N)*1000 = sqrt(.5*.5/1000)*1000 = 15.811388.

4. "So, my question #3 is how to get a regression equation for that?"

If the only source of error were binomial then you could calculate true var/total var as

r=(total var-err var)/total var

where err var=(p*(1-p)/N)*N

and
adj score=mean+(score-mean)*r

In your terminology, you would say that we are regressing 1-r or (err var/total var).

5. The example in post #12 is confusing. What are the deltas? Are they the # of heads-500? Are they the # of heads-your # of heads?

If the standard dev of those two groups are known to 0 and .028 respectively, then you would regress them 100% and 99.999% respectively.

I don't see how you get std devs of 32 and 64. I guess you didn't really run this coin flipping scenerio.

If binomial error is the only source of error then I can see how total std dev of 64 and error std dev of 32 will give you .75 which is close to your r of .73.

The problem with using the formula that I listed out in 4 is that it only deals with binomial error and there are other sources of error in baseball such as learning, health, adjustments by opposition, etc... that can't even be modeled like parks, age and opposition can. Some variables like plate umpire should effect balks, strikes, walks, but we don't even bother to factor them in even though they add some amount of error variance. The formula I listed in #4 underestimates the amount you need to regress.

I don't know if this clears anything up, but assuming that the binomial error is the only source of error and that it is uncorrelated to the spread of true talent, then yes there is a formula for regression to the mean. Use it at your own risk in baseball, but it should work in coin flipping experiments.

Posted 7:18 p.m., December 19, 2003 (#14) - tangotiger
My deltas are p-q, as opposed to what you are showing as p - n/2, or (p-q)/2. I think we're on the same page now.

Thanks for the responses Alan.

Posted 2:45 a.m., December 20, 2003 (#15) - AED
Tango - The short answer is that, for your purposes, multiplying the difference from league average by the variance ratio will give a reasonable regression to the mean, so long as all players caught similar numbers of innings. If there are wide differences in number of innings you have to do a little more work.

For the values in post #9, the observed variance is 4 (2^2) and the theoretical random variance is 1 (1^2), so you would regress by 1/4 towards the mean.

Posted 8:53 a.m., December 20, 2003 (#16) - tangotiger
AED, can you tell me how to do that extra work?

Right now, I limited it to the 29 catchers who caught the most, and then took the average of them to get about 45,000 PA, or about 20,000 PA with at least 1 runner on base. That's a good shorthand, but I'd like to capture at least 100 or more of the catchers.

Posted 4:36 p.m., December 20, 2003 (#17) - Alan Jordan
Tango,

going back to post #3, where do you get that the standard deviation of random is 1? How do you know it's 1, or is this an assumption?
I guess this also applies to pitchers' balks in post #9. I would feel a little more comfortable if I understood where you got that part.

Also my delta was just p-.5. Since p-q is about double p-.5 when p is near .5, that explains the difference in std dev of error between your's and mine.

Posted 11:18 p.m., December 20, 2003 (#18) - AED
Here is the easiest way I can think of to do this empirically.

Divide the catchers into groups based on number of plate appearances. For each group, calculate the variance in each statistical category, as well as the random variance. Regress each catcher's value towards the group average (not the overall average) using:
value = player + (average - player) * (random variance) / (actual variance)
This saves a lot of work, and allows you to regress each player to the average of similar catchers.

If you want to do a little more work, you could determine the average and variance ratio as a function of typical innings caught per group. These should be reasonably smooth functions, and you could regress each catcher to his own mean using his own regression amount. The variance ratio (assuming you are working in rates rather than totals) should go something like 1/(1+a*n), where a is a fixed constant and n is the number of innings caught or plate appearances.

All of this assumes that you can calculate the random errors, of course. I assume you are, but want to make sure you take into account the error contributions from the fact that "pitcher with other catcher" has random and systematic offsets from the true desired baseline of "pitcher with average catcher".

Posted 3:04 a.m., December 21, 2003 (#19) - tangotiger
Post #3 was illustration only.

Perhaps the problem is that I'm using the wrong terms. I think I should have said that the variance was "1 standard deviation = 1 bk / 162 GP".

Posted 12:08 a.m., December 22, 2003 (#20) - Arvin
Alright... here are some preliminary thoughts.
Your catcher's delta is a difficult random variable to deal with.
Why? You're determining it by a fairly complex method.
Ostensibly, it's similar to the sum of two binomials.
The problem is, the binomials may be vastly different.
eg. Carter:
PB distributed as Binomial(n=8000,p=.001) -making up numbers here.
PB2 distributed as Binomial(n=2000,p=.002)
X0 = PB-PB2*8000/2000
X = X0 normalized to 162 GP.
Thus, X is a mixture of two binomial random variables with different
N and different p.

Posted 2:27 a.m., December 22, 2003 (#21) - AED
If you're having difficulty with the variances, I'd suggest approaching it from the opposite direction:
value = average + (player - average) * (intrinsic variance) / (actual variance)
The "actual variance" is the square of the observed standard deviation of the group of players, and the "intrinsic variance" is the variance due to player abilities. I would also suggest keeping this purely in rates (per PA) until the very end, at which time you can multiply by some number of PA per 162 games.

Calculating "intrinsic variance", which I'll abbreviate "ivar" is not too tough. For one pitcher, the variance among rates per catcher equals:
variance = <c/npa> + ivar,
which can be rewritten:
ivar = variance - <c/npa>
"c" is the random variance per PA and equals the pitcher's career rate times (1-pitcher's career rate). In other words, if 1% of plate apperances have wild pitches, c equals 0.01 * 0.99. The average is taken over all catchers to have caught the pitcher for some minimum number of plate appearances (you choose the cutoff -- too low makes for more random noise; too high can cause biases to creep in), and "npa" is the number of plate appearances each catcher had with the pitcher. You calculate the variance directly from the rates those same catchers had while working with the pitcher.

Running through this process for all pitchers to have worked with at least, say, 5 or 10 catchers for significant amounts of time, the equation above becomes:
ivar = < variance - <c/npa> >
The outer average is taken over the pitchers; the inner average is taken over each pitcher's catchers. So run through this process to calculate a single value of ivar for the rate in question.

As noted in my earlier post, you should group catchers by career number of plate appearances. The "average" and "actual variance" of the rates for each group are determined separately, and combined with "ivar" from above will give the group's regression to average.

There is a MUCH more difficult way you could approach this, modeling the variances of the catcher ratings directly, but I doubt you would gain much over this technique (aside from a splitting headache).

Posted 8:12 a.m., December 22, 2003 (#22) - tangotiger
Thanks for the input guys. I think I'll need a couple of hours to digest this, and then play around with it. Hopefully, I'll have something to report before xmas.

Posted 1:12 p.m., December 22, 2003 (#23) - Arvin
Ok, further thoughts...

1) The binomial is a funny distribution. It's very similar to the normal distribution for .3EXTREMELY skewed. What to do about it? Well, you can conclude that the normal approximation will do nothing for you. You can't use it.

2) back to the mixture of binomial R.V.'s:
PB-Carter = PBC ~ Bin(8104,.0009) (µ=7.3, σ² = 7.2)
PB-others = PBO ~ Bin(3598,.0022) (µ=7.8, σ² = 8.0)
δPB = PBC - (8104/3598)*PBO

Formula: Var(aX) = a^2*Var(X)
Thus,
Var((8104/3598)*PBO) = (8104/3598)^2*Var(PBO) = (8104/3598)^2*3598*(.0022)*.(1-.0022)
Thus, the second term in the mixture, (8104/3598)*PBO,
has (µ=17.8, σ² = 40.8)

δPB = PBC - (8104/3598)*PBO
= (µ=7.3, σ² = 7.2) - (µ=17.8, σ² = 40.8)

Alright, what then? Simulation results(n=10,000) give:

δPB = (µ=-10.5, σ² = 48.4)

The variances don't strictly add, as you would expect with a normal distribution. Here's a histogram chart of the resultant R.V:
center-of-bloc count
-40.5 4
-35.5 16
-30.5 83
-25.5 345
-20.5 960
-15.5 2146
-10.5 2690
- 5.5 2423
- 0.5 1085
+ 4.5 230
+ 9.5 18

Things I notice:
a) the variance from the small n sample dominates.
b) the variances come close to pure additive variance. You could probably fudge the variance calculation by approximating an additive model and then fudging upwards a little bit.
c) the distribution is skewed but not too crazily skewed.

Next:
You're using ΔPB, which is δPB normalized to 162 Games Played.
Q) How do you do this normalization?
So far, we have δPB = -11 over 8104 PA. How is this normalized to Games Played?

Posted 1:14 p.m., December 22, 2003 (#24) - Arvin
First line edited:
1) The binomial is a funny distribution. It's very similar to the normal distribution for .3<p<.7, but as you near the edges, it becomes increasingly skewed towards .5.

Posted 2:42 p.m., December 22, 2003 (#25) - tangotiger
I use something like 5500 PAs per year (about 140 games).

You'll notice that Gary Carter's line reads a delta of -76 PB with 72,385 PAs. In the next table, we have Carter at 13 effective seasons (72,385/13=5568). His per season delta PB is -6. So, either -76/13, or 76/72835*5500.

Posted 3:46 p.m., December 22, 2003 (#26) - Arvin Hsu
So to continue,

δPB = (�=-10.5, σ� = 48.4)

ΔPB = δPB *5500/8104 (in this case)
ΔPB = (�=-10.5*5500/8104, σ� = 48.4*(5500/8104)�)

Posted 3:47 p.m., December 22, 2003 (#27) - Arvin Hsu
ΔPB = (�=-7.1, σ� = 22.3)

Posted 12:37 a.m., December 23, 2003 (#28) - Alan Jordan
Simulated Data

Pitcher Catcher PA 2B

1 1 2049 182
1 2 5516 64
1 3 9770 220
1 4 4327 18
1 5 6025 728
1 6 3172 27
1 7 3187 90
2 1 9900 129
2 2 697 2
2 3 3171 7
2 4 5785 5
2 5 3970 77
2 6 4110 4
2 7 9546 37
3 1 4198 97
3 2 865 2
3 3 2926 12
3 4 3072 2
3 5 7483 217
3 6 4924 6
3 7 9204 46
4 1 2793 18
4 2 8393 6
4 3 7099 5
4 4 4957 1
4 5 5266 46
4 6 6995 5
4 7 5887 6
5 1 9416 1192
5 2 5438 110
5 3 9967 359
5 4 8649 69
5 5 9512 1742
5 6 1880 23
5 7 695 29
6 1 9892 72
6 2 8733 16
6 3 2465 3
6 4 6545 1
6 5 9491 111
6 6 1865 2
6 7 1555 5
7 1 5629 130
7 2 871 2
7 3 7082 42
7 4 6103 12
7 5 6096 258
7 6 8911 15
7 7 5313 32

Use a logistic regression to model the probability of event given the pitcher and catcher. Use dummy variables for the first 6 pitchers and catchers. If all catcher dummy variables are 0, then the catcher is #7. Ditto for pitchers.

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -4.9170 0.0771 4068.2959 ChiSq

catcher 6 4129.6 <.0001
pitcher 6 4887.4 <.0001

As long as you include an intercept in the model and don't specify a coefficient for each combination of pitcher and catcher, then the estimates are regressed to the grand mean.

Posted 12:50 a.m., December 23, 2003 (#29) - Alan Jordan
O.K, that didn't post correctly. I can email the details if anybody wants them. The point is that people have already figured out a way to deal with this problem. It looks trying to reinvent the wheel.