Solving DIPS (August 20, 2003)
Note: You will find it easier to right-click the above link, and "save target as" to your hard drive, and open it directly from there.
This is a PDF document that is a summary of the recent thread at this site. It is 22 pages long (only 40 KB) and contains no additional information. Its purpose is as a reference document for those who have not followed this thread in its entirety. I'm hoping that I've captured most of the important data, arguments and conclusions from that recent thread.
I may produce a smaller summary of this document which will cut through all the math as much as possible, while still presenting the important data and conclusions.
Most of the credit for this document should go to Erik and Arvin for doing the grunt work and providing the critical insights. When citing this document, you should cite "Erik Allen and Arvin Hsu", or "Erik Allen, Arvin Hsu, Tangotiger, et al".
--posted by TangoTiger at 09:34 AM EDT
Posted 11:53 a.m.,
August 20, 2003
(#1) -
Erik Allen(e-mail)
Really nice presentation Tango. I would say that you hit all the high points of the discussion.
I have held off on doing more simulations to this point, since it appears you can also do this work analytically, but I can run some situations if you feel there is a need.
I have not yet said thanks to Arvin as well, for all his work. Thanks, Arvin!
Posted 12:09 p.m.,
August 20, 2003
(#2) -
tangotiger
I'm actually exhausted Erik! I've been putting off doing some other baseball stuff for 2 weeks, and I'm really happy with the way this thread unfolded.
Feel free to run more sims, but I don't think they're necessary at this point. They might be valuable if you do more breakdowns, like with GB/FB and Lefty/Righty and by the 7 fielding positions, etc. I think ARvin's equation works on independent variables, but I don't think that would apply here.
But, I think your work and Arvin's should shake things up!
Posted 3:55 p.m.,
August 20, 2003
(#3) -
Jim R
Just a comment on the write up and not the actual information. FWIW, I particularly enjoyed the write up and I thought it was fairly tight. Maybe I didn't save any time from reading the write up as opposed to reading the thread, but I at least had the illusion that I saved time :)
I know you probably have better things to do, and I guess BPrime isn't springing for you to have a clerk, but I do find these items useful.
I think you will find that besides myself, there are people that fit this profile:
(1) Are interested in your work and the work of your contemporaries
(2) Have variable amounts of time we can devote to reading the results (I've been on partial vacation for a week).
(3) Probably have little to offer in the more serious development of the ideas.
Your synopsis provide great utility to me (or us). I hope if not you, BPrime can find a way to keep them coming from time to time.
Posted 4:04 p.m.,
August 20, 2003
(#4) -
tangotiger
It's an interesting thought, and I'll pass it along to the group.
One thing I did a few months ago was to cleanup my website by ordering things, so that it's more useful. Of course, since then, I've added a few more articles, but I have not updated the index to point to them. Story of my life. My wife has been after me to update the pictures of my baby on our personal site. I'm 7 months behind that too.
Many many times I've thought about doing a "best of" kind of deal, and putting things in one place. But like you are alluding to: time/money/work/family is a tough thing to balance.
I agree though that it's nice to have everything in one place, and I think that within 1 year, maybe less, I'll have consolidated everything I've done into something organized, if not in PDF/book format, at least in a "finalized" fashion.
Thanks for the idea!
Posted 4:29 p.m.,
August 20, 2003
(#5) -
Andrew Edwards
I belong to the class of person Jim R. described, and I too find these unbelieveably useful.
Is there a way someone like us could help?
Posted 7:59 p.m.,
August 20, 2003
(#6) -
FJM
I took a quick look at the Team ZR's for SS and 3B, 2001-03. I wanted to see how close they came to your estimate for observed standard deviation (.025). The answer is, pretty close, but there is an important caveat attached.
At SS I got a 3 year average of .843 with a st. dev. of .027, very close to your number. But that doesn't tell the whole story. The skewness parameter is -.766, which is highly significant. What does that mean? Two things. 1)There are a lot more teams with above average shortstops (49)than there are with below average ones(41). However, 2)the difference between the worst shortstop (.741) and the average is much greater than the difference between the best (.918) and the average. The distribution is skewed to the left. THIS IS NOT A NORMAL DISTRIBUTION; it's not even close. That makes sense, when you think about it. There is a practical limit to how good a shortstop can be. Only Superman or the Flash could get to every ground ball, and only the guy with the big red "S" could throw out every runner from deep in the hole. On the other hand, a really bad shortstop is limited only by the patience of his manager. (Incidentally, the Yankees rank either 29th or 30th all 3 years.)
The situation is a bit different at third. The average is .762, suggesting that a lot more balls get through. The st. dev. is also higher, .031, suggesting there is greater variation in ability. That makes sense too, since 3rd base is viewed by most people as being primarily an offensive position. Yet the skewedness parameter is much less extreme (-.198). There are nearly as many teams below average at 3B (44) as there are above (46). Assuming a normal distribution here is probably OK.
Posted 10:26 p.m.,
August 20, 2003
(#7) -
Tangotiger
(homepage)
FJM: check out the above link. It lists the UZR for all players, min 120 games over 4 years. Maybe you can take that, bring up the threshhold to 240 games or 300 games or something, and run your thingie again. I'd like to see the results against UZR.
I agree that there is greater variability at 1b,3b than ss,2b. I think my numbers bore that out (.022 or something for ss and .027 or something for 3b). It's reassuring that ZR showed something similar, but a bit higher (which we'd expect because ZR includes the park factor, and pitcher tendency/handedness effect, which UZR strips away).
Anyway, just eyeballing the UZR chart, and things do look normal, but I agree that you would expect at positions that don't tolerate bad defense to have a different skew. 3b is neutral-type of position, and so we should expect no skew, and wide variance. 1b we expect the skew opposite of SS.
Good stuff!
Andrew: thanks for the offer. I'm not sure what can be done. I usually work on impulse, and have a habit of leaving alot of things unfinished.
Posted 3:06 a.m.,
August 21, 2003
(#8) -
FJM
I'll give it a try. I wouldn't expect too much, though. The wide variation in number of games played will make the observed standard deviation questionable, unless the selection criterion is set awfully high. And if I do that, I won't have enough observations to work with. Is there any way to get annual Team UZR's? I'm also unclear how UZR Runs/162 translates to UZR percentage.
Posted 7:49 a.m.,
August 21, 2003
(#9) -
Tangotiger(e-mail)
I can send you the annual Team UZRs. Send me an email.
To translate runs into a rate stat, you divide by the number of plays per year at that position. For example, I think I set 1B at 2x162 and 3B at 4x162.
(Actually, I kind fudge a little: if a SS makes 3 of the 21 outs on BIP, and there were 28 BIP, I give him 4 "plays". It kinda keeps things in line, since each BIP doesn't belong to any one fielder.)
Posted 10:43 a.m.,
August 21, 2003
(#10) -
studes
(homepage)
Tango, thanks for the summary. I'm looking forward to the summarized summary, too. I definitely can't keep up with the mathematics, though this summary helped me a lot.
Non-mathematical comment: When I read these sorts of studies, I always wonder if there's a way to include ALL elements into the study. In other words, if you include batters by batter type, let's say, might you come to different conclusions? In particular, if batter were included, along with luck, park, pitching and fielding, what would happen to the relative results?
I'm sure this is nearly impossible to analyze with the data.
Posted 10:54 a.m.,
August 21, 2003
(#11) -
tangotiger
Hmmm... the batter. From the perspective of the pitcher, the true variance of the batter, and any random element, would be zero (I think I'm saying that correctly). Even something as substantial as the park has a stdev of .004, barely making a dent into the equation.
I don't think it's an issue in this case.
Posted 2:24 p.m.,
August 22, 2003
(#12) -
tangotiger
(homepage)
Guys,
I just wanted to thank you all for this thread again. It's been a very big eye opener for me, and I enjoyed tremendously the work that Erik and Arvin especially put in, as well as the different perspectives of everyone who posted. This may have been the only DIPS thread where it was truly a pleasure to read everyone's posts.
I don't think I will be doing a summary of this summary. If someone would like to do it, feel free to jump in.
I've been trying to "bend the wand" for about a month now, but this great DIPS work really reeled me in. And other things that I've read on other topics around (like at battersbox and Clutch) have conspired to pull me in further.
Anyway, looks like the only way for me to stop procrastinating is to go cold turkey. So, after this weekend, I won't be stopping by for a while, or reading anything else online. If someone wants me to post some links in Primate Studies, I'll be glad to do so, but I won't offer any of my thoughts on the matter. I'll be back in time for the World Series in a limited capacity.
MGL and I have talked about maybe starting a site to preview our research, so maybe we'll have something worked out by then. You can join the group at the "homepage" link above to be on the mailing list.
Thanks again guys.... truly fun to talk with all of you.
Tom
Posted 4:53 p.m.,
August 22, 2003
(#13) -
Dirk
(homepage)
I wouldn't be so quick to leave the batter out of it. If you look at one of Baseball Prospectus's new reports (homepage), you'll see that the batters faced by pitchers are not at all evenly distributed. Limiting the question to pitchers with >20 starts and >100IP, the limits of the spread are
Zambrano_Victor TBA 21 133.7 635 .271 .345 .445 .790
Jennings_Jason COL 27 151.3 679 .256 .325 .399 .724
That's a difference of nearly 1 RA/9 IP using RC, so that's nothing to sneeze at. Someone else will have to do this for what we're actually interested in (Batter's BABIP) rather than the standard stats.
Of course, once you start adjusting pitchers for batters faced, you wonder if you should adjust the batters for pitchers faced...I don't know how you stop.
Posted 9:06 p.m.,
August 22, 2003
(#14) -
FJM
Dirk: I'm guessing that Woolner simply used each batter's overall stats to compute these averages. But if you're trying to determine how tough a batter is for a particular pitcher, that's the wrong approach. At the very least you need to consider how each batter does against LHP/RHP, whichever is appropriate. Depending on the pitcher, you might also need to look at how each one does against GB or FB pitchers and/or power/finesse pitchers. In its present form, this is pretty useless.
Posted 5:24 p.m.,
August 25, 2003
(#15) -
Robert Dudek
I think the batter needs to be included in any full assessment. Major league hitters differ greatly in their ability to get hits on balls in play. The aggregate of hitters pitchers face are going to differ over a 200-800 PA sample quite a bit, I'd guess.
Posted 6:37 p.m.,
August 25, 2003
(#16) -
Tangotiger
Remember the equation:
True variance (DER) = True variance (pitching) + True variance(fielding) + True variance (park) + True variance (hitting) + True variance (fill in the blanks)
We know that the true variance is .012 for DER. My guess is that the true variance for hitting, from the perspective of the pitcher, to be close to zero.
I'm pretty sure this is how we are supposed to look at it, but I'll defer to the statisticians.
Posted 6:57 p.m.,
August 25, 2003
(#17) -
Dirk
Tango -- I remember that equation. My guess, after looking at Woolner's stats, is that the variance of the hitting, as perceived by the pitcher, is not zero. But obviously that stat page doesn't prove anything, because it's about more than just balls in play.
I'm a neophyte at actually running these numbers myself rather than reading what the rest of you do. I downloaded ASS over the weekend and started playing with it, but it doesn't seem to have the horsepower to compute something like this. And I can't make sense of its raw data files easily.
If I could get a data file with rows something like this
pitcher batter pitchHand result
where result is limited to K/BB/HR/1B/2B/3B/out-in-play
I'd get the rest of the analysis going.
Posted 11:01 p.m.,
December 26, 2003
(#18) -
tangotiger
This article is this week's "Oprah's Book of the Week", and required reading for anyone who missed it.
Posted 12:51 a.m.,
December 27, 2003
(#19) -
MGL
As I said on Fanhome, that is a phenominal article. It should make your head spin!
Anyway, looks like the only way for me to stop procrastinating is to go cold turkey. So, after this weekend, I won't be stopping by for a while, or reading anything else online. If someone wants me to post some links in Primate Studies, I'll be glad to do so, but I won't offer any of my thoughts on the matter. I'll be back in time for the World Series in a limited capacity.
I feel for you as much as anyone of course, as I periodically get addicted to Primer and Fanhome. However, how many times have you threatened to leave for a while and then come crawling back? ;)
Actually I need to do the same thing and concentrate on my real work and the book...
Posted 9:50 a.m.,
December 27, 2003
(#20) -
tangotiger
I'm drawn by the intelligence of the readers here... it's my vice. But, yes, I am once again (third time now?) wondering whether to take a break.