Correlation between Baserunning and Basestealing (December 10, 2003)
Mostly math...
--posted by TangoTiger at 11:31 AM EDT
Posted 12:00 p.m.,
December 10, 2003
(#1) -
Michael Humphreys
Tango,
This is a neat and surprising discovery that runs counter to twenty years of sabermetric wisdom. (We should think of a good name for this stat--how's about Quick-And-Dirty Baserunning Runs ("QAD-BR")). [Pronounced Quad-Bee-Are ;-)] I think where it might really have a big effect is in explaining Rickey Henderson's extraordinary career Win Advancements (based on that other guy's first attempt), which seem to imply that he was better than his Linear Weights value. As I mentioned in one of my posts to your Win Advancements thread, James got surprisingly high values for Rickey using sims--maybe as high or higher than for Mays. (And Mays would probably be someone else for whom the new .3*(SB-CS) metric might be revealing. His SB% was good, but not outstanding. But legend has it that he was an amazing baserunner.)
QAD-BR might also explain why when regressions are run on team offensive data, CS don't show up as bad as they are in isolation. CS "carry" information about non-stolen base baserunning effectiveness.
I suppose more analysis would be a good idea to determine whether the
regression result you obtained works at the extremes, such as Rickey. Or for notoriously "over" aggressive basestealers for whom we have Super-Linear Weights Baserunning Runs.
And I suppose the caveat to keep in mind is that, based on Win Advancements, basestealing and baserunning have, with the rarest exceptions, little practical effect. But the exceptions could be interesting.
Another thought--to refine the estimate for QAD-BR, might we consider
triples?
Posted 12:14 p.m.,
December 10, 2003
(#2) -
tangotiger
Michael,
re: Rickey. I did a (perhaps my most rewarding) study on the effects of batting order in a no longer web available thread at fanhome called "Linear Weights by the 24 base out states". And, in there, I made the case that Rickey's optimal talent for the one batting spot where optimization is the most critical (leadoff spot) could add 10 runs per year above what his neutral-batting-order LWTS would suggest. It is a huge, huge number. An extra 15 wins in a career potentially. (I am in the middle of redoing the study, this time also adding Markov, as well as empirical pbp data for the eventual book.) Baserunning may also add something to Rickey (probably an extra 5 runs a year as well). Rickey may have been the best nonpitcher post-Mays and pre-Bonds.
The Ed Oswalt "win probability added" I don't think included baserunning, giving all the credit to the batter.
And yes, that equation was strictly done against the 2000-2002 data, and therefore would not necessarily apply to the Raines/Coleman/Rickey era. Presumably, the fast runners of today were as fast then, and therefore, we shouldn't expect more than +/- 5 runs added on baserunning back them. However, their SB totals were double those of today. Cutting the BR best-fit in 2, and you'd get something like BS+BR = .25*SB - .40*CS. We just have to be careful what we do.
Finally, I have a high degree of correlation between triples/(2b+3b) and sb/timeOn1B. I'm sure adding triples would increase the correlation, but I doubt it would add much more. But, if we have it, we should use it.
Posted 12:21 p.m.,
December 10, 2003
(#3) -
tangotiger
What the heck... I included the triples rate, and the r went up to .65. The best-fit was .08 * SB + .07 * CS + .04 * Triples - .01*timeOn1B. The r of just triples to BR was .53.
Posted 12:55 p.m.,
December 10, 2003
(#4) -
dlf
Presumably, the fast runners of today were as fast then, and therefore, we shouldn't expect more than +/- 5 runs added on baserunning back them.
I can think of several reasons why this MAY not be true:
1. Artificial turf. All other things being equal, it is easier to go from 1st to 3rd on a single if running on the plastic. There has been a dramatic decrease in the number of turf parks since the Rickey / Raines / Coleman heyday. I would guess that the change has decreased both attempts and success rates. On the defensive side, a ball right at a fielder gets to him quicker on plastic, decreasing attempts and rates, but a ball to either side is more likely to get by increasing those attempts and rates. I have no earthly idea how to figure the change from that.
2. With fewer homers and lower overall scores, there was a perception that the benefit of advancing the extra base was higher and the risk of being thrown out lower. I would posit that runners were making the attempt more often but getting thrown out a higher percentage of times. Depending on the exact changes in attempts and success ratio, that could either push up or push down the +/- 5 runs estimate.
3. It seems that there is a greater emphasis on offensive skill versus defensive ones now than during the era from Maury Wills and Lou Brock through Omar Moreno and Ron LeFlore to Vince Coleman and Kenny Lofton. I can't prove this, but I think outfield arms have declined in accuracy and baseball relevant strength. I can't think of any current player who would fit nicely with Dewey Evans, Jesse Barfield, Dave Parker or others. (And yes, I see the irony of having listed noodle armed Brock & Moreno as playing in an era with better OF arms.)
I guess I don't really see a way of realistically taking the +/-5 from 2000-02 into the 1970s and 1980s let alone 1890s through 1910s.
Posted 1:05 p.m.,
December 10, 2003
(#5) -
Michael Humphreys
Tango,
I used to think Mike Schmidt was the best post-Mays, pre-Bonds non-pitcher, but I'm starting to come around to Rickey. Aside from his outstanding OBP and record-shattering baserunning, he was also a very, very good fielder, based on his DRA ratings. I only have ratings for the nine seasons in which he played 130 or more games at one outfield position, but even just looking at those seasons, he saved over 100 runs in the field. Basically, Rickey was top-flight-to-outstanding at everything except power, and was above-average in power as well. When you start adding up it all up, he's been an incredible machine for winning games--particularly close games. If you're ahead by only a run in the ninth inning, you want a Rickey in the outfield; if you're behind by only a run in the ninth inning, I can't think of anybody else you'd want coming to the plate.
So adding the triples only increased the r from .63 to .65. Good thing to know--might as well keep things simple.
Not sure I understood the following: "Cutting the BR best-fit in 2, and you'd get something like BS+BR = .25*SB - .40*CS. We just have to be careful what we do." The being careful part I understand--what about the -.40 cost of SB for the 1980s?
Posted 1:36 p.m.,
December 10, 2003
(#6) -
J Cross
Zips 2004 projected QAD-BR leaderboard:
Name............Team....SB......CS......QAD-BR
Carl Crawford*..TB......50......10......12.0
Juan Pierre*....FLA.....55......19......10.8
Carlos Beltran#.KC......35.......5......9.0
Alfonso Soriano.NYA.....37......12......7.5
Jose Reyes#.....NYN.....37......12......7.5
Alex Sanchez*...DET.....45......22......6.9
Dave Roberts*...LA......35......12......6.9
Ichiro Suzuki*..SEA.....34......13......6.3
Posted 2:12 p.m.,
December 10, 2003
(#7) -
tangotiger
(homepage)
You are right that the LWTS for SB would be different in the 80s. I don't remember off the top of my head what it would be, but being about .05 runs below the 99-02 time period is about right. The SB value is pretty constant across the 3-6 RPG environment.
I would guess that if BR = .10SB + .12CS in 99-02, then it might be = .05SB + .06 CS in the 80s. So, .20+.05 for SB and -.41+.12 for CS.
So, SB-CS seems to work pretty well. It's just the constant (.25 or .30 or whatever) that needs to change.
**********
I agree, there may have been more fast players in the 80s.
As for changes in going from 1b to 3b, you can check out John Jarvis' site. It's a great resource. (See homepage link above.)
You can also check out what I've got here:
http://www.geocities.com/tmasc/destmob.html
That's from 1978-1990. I would guess if you look at Jarvis' data for 1999-2002, you'll probably get similar numbers.
Posted 2:13 p.m.,
December 10, 2003
(#8) -
studes
(homepage)
I may be wrong, but it seems to me that you have a huge multicollinearity issue when you include both SB and CS in your formulas (that is, the correlation between SB and CS is huge). I would think that it undermines the equations, though I don't know how much.
Posted 2:37 p.m.,
December 10, 2003
(#9) -
tangotiger
I think I mentioned the high correlation between SB, CS and triples.
If you want a best-fit with only SB, and only CS, and only triples:
BR = [.11 3b/(3b+2b) - .01] * timesOn1B
BR = .13 * SB - .01*timesOn1B
BR = .36 * CS - .01*timesOn1B
Posted 3:38 p.m.,
December 10, 2003
(#10) -
studes
(homepage)
Yah, I know you mentioned the correlation; I was just pointing out some of its implications.
Anyway, merging two of the formulas in this way is a bit more interesting, at least to me:
BS+BR = 0.20 * SB - 0.09 * CS - .01 * timeson1B
Harder to calculate, but it more dramatically shows the implicit "positive" correlation of CS on overall baserunning lwts, and also maintains a rate factor.
Great stuff, Tango. Extremely insightful.
Posted 7:55 p.m.,
December 10, 2003
(#11) -
MGL
Since a lot of smart guys hang out here but not on Fanhome, where this idea was started and continues, maybe someone can help me out here:
I understand how CS can correlate positively with baserunning runs (BRLWTS). But that is only because CS correlates with SB attempts and with SB themselves. It is like K's and offensive production. They correlarte only becuase players with high K's also have high HR's. There is no cause/effect relationship.
Given that, how can we use the formula BR = .10SB + .12CS for any individual player?? Clearly, the higher the CS's for any given SB, the BRLWTS fo not go up! If we have a regression equation that only includes CS, then yes, we can use it to predict or estimate BRLWTS such that the higher the CS, the "faster" the runner. But surely once we use SB already, the VS have to negatively correlate with speed or baserunning! So the formula should read something like BR=.something*SB minus(not plus).something * CS.
The examples I gave on Fanhome were:
player A has 100 SB 20 CS
player B 100/40
player C 100/60
Tango's formula says that player C is the "fastest" (has the highest BRLWTS). That is absurd!
When you combine the basesteling formula with Tango's baserunnin formula, you get the correct sign (correlation) for CS, but that is an accident. What if CS were not that bad, such that the correct basestealing formula were SB/CS runs=.20*SB-.10*CS? Now if we add up the two formulas to get GADBR, we have the "wrong sign" for the CS (we get QADBR=.30SB+.01CS).
Someone help me out here!
Posted 7:57 p.m.,
December 10, 2003
(#12) -
FJM
Another way around the multicollinearity problem.
Define 2 new variables: SB1=SB+CS and either SB2=SB-CS or SB3=SB-2*CS.
Posted 8:03 p.m.,
December 10, 2003
(#13) -
MGL
Now that I think of it some more....
Assuming that Tango's multiple linear regression of SB and CS on baserunning lwts was correct (I assume he did a "normal" MLR analysis), is it possible that it is true that even for a constant SB, that players who have more CS's are actually faster (have higher baserunning lwts)? IOW, what is more predictive of speed or baseruning lwts is a player's attempts and not their success rate?
I don't know that much about regression analyses, but when you do a multiple regression, don't the regression coeficients "assume" that the other variable is constant?
Is that where I am going wrong? I just assumed that a player with 100 SB and 20 CS would be faster than a player who had 100/40, even though the 100/40 attempted a steal more often?
Posted 8:11 p.m.,
December 10, 2003
(#14) -
MGL
BTW, as far as the overall value of Rickey, as compared to Schmidt or any other player, that is why Super-lwts (whether we use UZR, DRA, or any other good defensive metric doesn't matter) is so important and valuable, if I may say so myself. Without baserunning lwts and defense, and a few other minor things (GDP and moving runners over), we are leaving out a significant part of the picture for no good or even apparent reason.
If we add in Tango's custom lwts by batting order, we have almost everything we need to see who are the best and worst overall players in any era or accross era's (assuming we do cross-era adjustments correctly)...
Posted 8:32 p.m.,
December 10, 2003
(#15) -
David Smyth
---"I just assumed that a player with 100 SB and 20 CS would be faster than a player who had 100/40, even though the 100/40 attempted a steal more often?"
That is not a safe assumption to make.
Posted 9:19 p.m.,
December 10, 2003
(#16) -
Ted T,
These results are all what theory predicts... cf. the theory paper I wrote which was primated (or I guess it was a clutch hit back then, this was March or so) in the spring. SB and CS *should* be highly correlated, because success percentage should (as it does) have a narrow distribution in the population, but attempt percentage should (as it does) have a skewed distribution in the population. So CS *should* be very informative about baserunning ability.
The other paper of mine which tango primated early November makes essentially the same point as the post which started this thread.
Posted 11:34 p.m.,
December 10, 2003
(#17) -
J Cross
"I just assumed that a player with 100 SB and 20 CS would be faster than a player who had 100/40, even though the 100/40 attempted a steal more often?"
Or maybe, and this may be a stretch, Base Runs are a function of aggresiveness as well as speed. The guy who gets caught stealing 40 times is obviously willing to take some chances.
Posted 11:40 p.m.,
December 10, 2003
(#18) -
MGL
Or maybe, and this may be a stretch, Base Runs are a function of aggresiveness as well as speed. The guy who gets caught stealing 40 times is obviously willing to take some chances.
Actually, that may very well be true. We think that baserunners on the average are WAY to conservative. We will address that in our book. Over-aggresiveness on basestealing is a bad thing, but over-aggressiveness on baserunnng may be a very good thing, so your theory may have some merit....
Posted 4:21 a.m.,
December 11, 2003
(#19) -
Dackle
Incredible stuff Tango, arguably the best I've seen here or on Fanhome in months. Michael Humphreys, nice to tie it together into an essential number -- QAD-BR, a sweet stat desirable to see on a regular basis next year and beyond. You guys rock.
Posted 10:06 a.m.,
December 11, 2003
(#20) -
tangotiger
I posted one last study at the fanhome thread. Go to the bottom and look for around today's date/time.
Posted 12:31 p.m.,
December 11, 2003
(#21) -
Michael Humphreys
Tango,
Your last comment at you last post at fanhome was interesting. Paraphrasing a bit, when you look at the players with the highest gross number of SB+CS+3B, Super-LWTS Baserunning Runs drop just slightly if the SB success rate is average or below average. So there might be a slightly non-linear effect. Wonder whether there is a simple transformation of the data that would improve the fit.
Maybe the rule of thumb to use is take QAD-BR at face value if the SB success rate is at least average or slightly better; otherwise discount it ever so-slightly.
Also, and I'm sure you've done this, just lost track of it, QAD-BR projects total base*stealing* and base*running* runs in one number?
Posted 12:47 p.m.,
December 11, 2003
(#22) -
tangotiger
(homepage)
Yup, running = basestealing + baserunning
*********
If anyone wants, I uploaded the data so that you can play with it as well (see homepage link).
Posted 1:02 p.m.,
December 11, 2003
(#23) -
MGL
Regarding the data above..
OK, the plus guys are at the top and the minus guys at the bottom, but I can see no particular "order" for the list other than that. Is this part of an IQ test? In what order are the players in the list?
Posted 1:53 p.m.,
December 11, 2003
(#24) -
tangotiger
No order. Just data provided for those who want to do their own research. I think I've exhausted what I can do.
Posted 2:14 p.m.,
December 11, 2003
(#25) -
tangotiger
Ok, I added more columns of data (including the best-fit, as well as the difference between the best-fit and the actual baserunning LWTS, where you see that Roger Cedeno and Vlad are much worse baserunners than their speed says they should be).
And, I ordered the data by best-fit.
Posted 2:10 a.m.,
December 12, 2003
(#26) -
MGL
I thought that the best/worst column would be cool, but it looks like most of the "worst" are just players who rarely attempt steals and are really slow, as opposed to players who rarely attempt steals, but are not that slow. Also, I wonder how much randomness there is the best/worst ratings (I think a lot) as getting thrown out at a few bases or not can be a fluke one way or another.
I could have sworn there was some "order" to that first list, as it looked like most of the fast guys were at the top and the slow guys were at the bottom.
Wouldn't it be great if players could both steal and run the bases optimally, especially the fast ones (i.e., be less aggressive at the right times on basestealing and more aggresive on baserunning)? I guess we'll have to wait until our book comes out! :)
Posted 11:50 a.m.,
December 12, 2003
(#27) -
J Cross
What book? Can we get a preview?