Tango on Baseball Archives

© Tangotiger

Archive List

Injury-prone players (October 14, 2003)

I was floored when I read this, and I hope Steve drops by to present the premise and results.
--posted by TangoTiger at 09:08 AM EDT


Posted 12:00 p.m., October 14, 2003 (#1) - Steve Treder(e-mail)
  Happy to do it, Tom.

This whole thing started with a thread about Jeffrey Hammonds during the week of February 16th ... I've been unable to unearth it in the archives recently (maybe someone smarter than me can figure out how to do it). Anyway the discussion centered around the question of whether injuries to players are more or less random, unpredictable. I asserted that they aren't random at all, that some players are clearly more injury-prone than others.

Walt Davis weighed in, as usual, with fairness and wisdom. He said that simply pointing out injury-prone versus durable careers from the past proved nothing, that if I were to test my thesis, I should predict in advance which players would likely get hurt and which would remain healthy.

So, I consulted my 2003 Baseball Register, and just compiled two lists: the first a list of 50 players who appeared to have not gotten hurt much in their careers, and the second a list of 50 players who did get hurt a lot. I excluded both pitchers and catchers from the lists, since both of those positions are obviously prone to causing injuries. Here are the two lists:

Non-injury-prone players:

Bobby Abreu
Garret Anderson
Marlon Anderson
Jeff Bagwell
Pat Burrell
Mike Cameron
Johnny Damon
Ray Durham
David Eckstein
Brad Fullmer
Jason Giambi
Luis Gonzalez
Troy Glaus
Shawn Green
Ben Grieve
Marquis Grissom
Vladimir Guerrero
Todd Helton
Andruw Jones
Chipper Jones
Jacque Jones
Ryan Klesko
Paul Konerko
Mark Kotsay
Carlos Lee
Derrek Lee
Terrence Long
Tino Martinez
Fred McGriff
Doug Mientkiewicz
John Olerud
Magglio Ordonez
Rafael Palmeiro
Neifi Perez
Juan Pierre
Albert Pujols
Jimmy Rollins
Richie Sexson
Randall Simon
Chris Singleton
Sammy Sosa
Miguel Tejada
Jim Thome
Michael Tucker
Todd Walker
Daryle Ward
Craig Wilson
Jack Wilson
Randy Winn
Todd Zeile

Injury-prone players:

Edgardo Alfonzo
Moises Alou
Rich Aurilia
Adrian Beltre
Ellis Burks
Sean Casey
Roger Cedeno
Greg Colbrunn
Jose Cruz Jr.
David Dellucci
Erubiel Durazo
Jermaine Dye
Jim Edmonds
Carl Everett
Cliff Floyd
Nomar Garciaparra
Jeremy Giambi
Alex S. Gonzalez
Juan Gonzalez
Rusty Greer
Ken Griffey Jr.
Carlos Guillen
Ricky Gutierrez
Jeffrey Hammonds
Bobby Higginson
Todd Hollandsworth
Geoff Jenkins
Nick Johnson
Brian Jordan
Barry Larkin
Edgar Martinez
Bill Mueller
Phil Nevin
Jay Payton
Aramis Ramirez
Manny Ramirez
Pokey Reese
Tim Salmon
Reggie Sanders
David Segui
Gary Sheffield
Junior Spivey
Shannon Stewart
Fernando Tatis
Mo Vaughn
Fernando Vina
Larry Walker
Rondell White
Matt Williams
Dmitri Young

The first list of players is slightly younger than the second, with an average birth year of 1971.6, while the second list has an average birth year of 1969.7. I would consider this to be relevant, but not such a huge difference as to explain any differences in injury rates by itself.

I kept everyone updated throughout the season on the days spent on the DL for players from both lists; if anyone can access the archived thread, they can see the montly updates. The season-ending tally is as follows:

11 of the 50 players from the non-injury-prone list went on the DL during the 2003 regular season, spending 503 total days on the DL. 27 of the 50 players from the injury-prone list went on the DL during the 2003 regular season, spending 2,190 total days on the DL, or 81.3% of the grand total of DL days for the 100 players.

While I expected the injury-prone players to get hurt more often, this result was far more dramatic than I anticipated. In our discussions, I said that if the difference between the two lists was within 10% or so, it should be considered inconclusive; obviously the difference was vastly more than 10%.

While obviously not the most exhaustive study possible, I certainly consider this quick-and-dirty little exercise to be a strong indicator that injuries are not at all a purely random event, and that a player's injury history is a very good indicator of his future injury likelihood.

Walt also owes me a beer! :-)

Anyone who wants the full details of each player's stints on the DL is free to email me, and I'll send you my spreadsheet.

Posted 12:28 p.m., October 14, 2003 (#2) - tangotiger (homepage)
  The above link is probably what Steve was originally referring to.

You can google the following
hammonds treder davis site:baseballprimer.com

Posted 12:39 p.m., October 14, 2003 (#3) - tangotiger(e-mail)
  Steve, a couple of things:

1 - I'm not sure how I missed this thread, but I'm glad I saw it finally.

2 - I see that you have some players who were on the DL as of opening day. I also see that you selected the players over 3 weeks prior to opening day. However, were any of those players on the DL in the last week of 2002, the prior season? That is, they were on the DL until Oct 1, 2002, and starting from March 31, 2003. In fact, they could have been injured the whole time, and the way you selected the players, you might have grabbed a few like that. It certainly won't explain the whole difference though.

3 - Age should have been a controlled variable as you alluded to. Perhaps a breakdown of the 50 players by: born 1977 or later, born 1970-1976, born 1969 or earlier (or whatever boundaries you want to set so that the 100 players fall more or less as 33/33/33).

4 - As well, rather than just DL days, just "players on DL", and maybe a breakdown by "15 days or less", "60 days or more", "16-59 days".

At this point, the level of granularity might not leave us with much.

I think you already did great work on this. If you've had enough with this, send me your spreadsheet, and I wouldn't mind taking a look at this.

And for posterity, I think an "official" writeup of your findings is called for, and I'd be happy to post it here, or send it to the home page for publication.

Great job!

Posted 12:47 p.m., October 14, 2003 (#4) - Steve Treder
  "However, were any of those players on the DL in the last week of 2002, the prior season?"

Not to my knowledge. I tried to only pick players who were actively playing during spring training while I picked them, around March 1st.

I'll email you the spreadsheet, Tom, and you can go to town on it. I agree that controlling for age would be an important step; I misspoke earlier: the average birth year for the non-injury-prone players is 1972.9, and for the injury-prone players is 1971.0.

Posted 1:39 p.m., October 14, 2003 (#5) - tangotiger
  Thanks to Steve for providing the data.

I split his data into
Old: born 1970 or earlier (30 players)
Young: born 1975 or later (27 players)
MiddleAged: born 1971-1974 (43 players)

Among the old players, 10 were non-injury prone, and 20 were. The average DL time per class was: 9 days for non-injury prone, and 65 days for injury prone.

Among the young, 21 were non-injury prone, and 6 were. 18 DL days for the healthy ones, and 40 for the unhealthy. Though, at 6 data points...

Among the middle aged folks, 19 non-injury prone, and 24 injury-prone. TWO DL days for the healthy, and 27 for the unhealthy.

I think this is a fascinating idea, and if you want to improve on this study, I would suggest the following:

Do a matched-pair study. That is, you have 2 groups that are equals in terms of:
- age
- position
- body type
- performance level

but differ in the number of times and number of days on the DL over the last 4 years.

I would also say that you would want all players to NOT have been in the DL in 2003, so that there are no "lingering" effects.

Removing catchers and pitchers is also the right idea.

So, you find out say Alfonso Soriano's twin (born within 1 year, plays 2B, BMI within 1, above average hitter), but someone who has gone on the DL in 2000-2002.

You could potentially do this for past years, as long as you don't let your future knowledge make you select players.

Fascinating stuff Steve!

Posted 2:49 p.m., October 14, 2003 (#6) - tangotiger
  Just running some more stuff on Steve's data.

I get an r of almost .40 between age/proneness to days on DL.

Days on DL = x + y, where
x = 31 if injury prone
y = 1.3 * (Age - 23)

So, a non-injury prone 23 year old would be expected to be on the DL 0 days, while an injury-prone 36 year old would be expected to be on the DL 48 days.

Posted 3:36 p.m., October 14, 2003 (#7) - dlf
  Do a matched-pair study. That is, you have 2 groups that are equals in terms of:
- age
- position
- body type
- performance level

I suspect that adding "performance level" would cause introduce serious flaws into the study. Even using rate stats rather than cumulative performance, I would suspect that of any two players with identical "true talent" the player who historically been injured more injury would be less likely to actually perform as well.

Posted 4:04 p.m., October 14, 2003 (#8) - tangotiger
  dlf,

It's an interesting thought.

So, you are saying that if you look at players from 2000 to 2003, and you've got 2 players with a .280/.350/.470 line, but:

- player 1 did that with 1500 PA over the 4 years, though never on the DL in 2003
- player 2 did that with 2500 PA over the 4 years, and never on DL

that.... What would happen in 2004?

We're not really tracking what the performance level of the player will be in 2004, but rather what his DL status will be.

By taking two players that are similar in performance level, body type and position, we've got "twins". If they didn't have the same performance level, then maybe there's something else different about them.

Technically, you want the injured player to have a higher performance level (if above average), on a rate basis, to make these guys equals. Why? Because his observed rates occur on a smaller PA sample size, and therefore, are less reliable to his true talent level.

Posted 4:12 p.m., October 14, 2003 (#9) - dlf
  Tom,

I'm not explaining myself well. What you are trying to do is find sets of matched players who are identical in everything EXCEPT how injury prone they are. However, by introducing performance into the twining definition, I suspect that you have a dependant variable in the matching sets. I don't think it is just a question of sample size that needs to be overcome - rather when not on the DL, the injury prone player is less likely to be at 100% than is the non-injury prone one. I tend to think that the talent difference between injury prone Bob Horner and non injury prone Mike Schmidt was much smaller than the observed difference in their respective rate stats.

Posted 4:23 p.m., October 14, 2003 (#10) - tangotiger
  Hmmm... but how about the flip-side? How about a guy who is hurt, but doesn't land himself on the DL?

Isn't it possible that the non-DL guy will play hurt more than the DL-guy? I'm not sure... just throwing it out there.

Posted 5:01 p.m., October 14, 2003 (#11) - Vinay Kumar (homepage)
  Here's the link for the original thread. It's not worth bothering with the archives on this site when Google is available (actually, that's true for just about every site these days; Google does a better job archiving web sites than most sites do themselves).

Posted 5:06 p.m., October 14, 2003 (#12) - Vinay Kumar
  Oops, I somehow missed post #2. Sorry.

Posted 6:23 p.m., October 14, 2003 (#13) - Alan Jordan
  "Do a matched-pair study. That is, you have 2 groups that are equals in terms of:
- age
- position
- body type
- performance level"

Forget about matched pairs. You have to break continuous variables into discrete levels (i.e. age becomes 18-25, 26-30, etc...), you lose cases because they can't be matched and then you have arguments over what's a pair in the first place.

Go back to

Days on DL = x + y, where
x = 31 if injury prone
y = 1.3 * (Age - 23)

and add dummy variables for positions (X is a dummy variable).
You can add variables for body types (if you have that data) and performance. Also if you think that catchers wear out faster than other position players you can a slope dummy for catchers where cage=0 for all positions except catcher where cage=age. This allows age to have a different effect for catchers on the number of days on the dl (cage can also be called an interaction between position=catcher and age). You don't have to throw out any cases unless you think there is a group that is theoretically problematic.

You can also try nonlinear transformations of age such as the square, sqrt, log and inv to see if the effect of age increases/decreases per year as the players get older.

Show the t values or p values for your equation so people can tell if its just chance. I can't imagine that a coefficient of 31 isn't, but what about 1.3 for age?

Posted 7:26 p.m., October 14, 2003 (#14) - Tangotiger
  The "x"/"std error" was about 3.5 for the "31" and less than 1 for the "1.3".

If those refer to the standard deviations, then I suppose the second parameter is probably chance.

Can you fill in the blanks, Alan?

Posted 8:21 p.m., October 14, 2003 (#15) - Michael
  I think this is all good stuff, but I agree that looking at effectiveness is going to be important. I mean Sean Green may not have missed many games this year but he wasn't very effective.

Posted 10:10 p.m., October 14, 2003 (#16) - FJM
  You are mixing two different phenomena here: frequency and severity of injury. Frequency should be more predictable than severity. You don't want to treat a player with 3 different visits to the DL totalling 90 days the same as one with a single, 90-day layoff.

Here's what I suggest. Set aside severity for the moment. For each player who has been in MLB at least 4 full years run a regression where Y=number of times on the DL in Year N and X1 is the number of times in Year N-1, X2 is the number in N-2 and X3 is the number in X-3.

Posted 11:07 p.m., October 14, 2003 (#17) - Jim
  A related issue I'd love to see studied is whether there are any injury patterns on the team level. Do some teams have consistently more injuries than others, at least more than would be expected by chance?

This might tell us whether some teams are able to avoid injuries either by avoiding injury-prone players or by encouraging superior training/conditioning techniques. I think the average fan places little to no blame on a team that is hit hard by injuries, but rather attributes it almost entirely to bad luck. Yet this study implies individual injuries can be predicted to some resonable degree of accuracy, and so therefore a team should be able to at the very least make personnel decisions that are likely to reduce team injuries.

Posted 12:02 a.m., October 15, 2003 (#18) - Alan Jordan
  Tango- The "x"/"std error" was about 3.5 for the "31" and less than 1 for the "1.3".

They are standard deviations though they are usually referred to as t values. The shape of the t distribution has a different shape dependending on the number of cases you have. As your number of cases becomes larger the t distribution becomes the z distribution.

I'm somewhat surprised that your stat program doesn't provide a significance level. If you have a table of t values and have a little practice using it, you can translate your t values into significance levels or p values. If you don't have a table there are some rules of thumb to help.

If absolute value of t is greater than 2 then p<.05
if absolute value of t is greater than 3 then p<.01

according to what you're posting, being injury prone is significant even controlling for age at the p <.01 (actually p<.001 here), while age isn't significant. Of course if there is a curvelinear relationship between age and being on the dl then linear regression is going to underestimate the relationship here because age is specified as a linear effect but that's kind of piddling here because the mean difference was only one year with two groups of 50 cases.

Basically the sample here implies that injuries are more a function of individual players than age.

Posted 12:22 a.m., October 15, 2003 (#19) - Alan Jordan
  FJM - You are mixing two different phenomena here: frequency and severity of injury. Frequency should be more predictable than severity. You don't want to treat a player with 3 different visits to the DL totalling 90 days the same as one with a single, 90-day layoff.

It wouldn't hurt to run the analysis both ways. In fact that's very often done. The researchers might present both results or summarize one in the footnotes.

It's an empirical question but my guess is that severity and frequency are correlated. People who miss work often also tend to be out for longer periods. That's a different process from baseball because motivation tends to push people away from work but baseball players towards playing. Anyway there is a theoretical justification for trying to predict days on the dl.

There is also another reason for doing days on the dl instead of number of trips. Number of trips to the dl is discrete (0,1,2,3...). Since most players will have 0, some will have 1 and a smaller group will have two etc... it will be difficult to get a high r-square and more difficult to get significant p values because of the low amount of variance (everyone bunched towards 0) and having a discrete dependent variable (discrete dependent variables tend to have lower r-squares) Also the justifying assumptions of linear regression tend to break down when you have discrete dependent variables which can cause your significance levels to be wrong. Being anal about this I would model number of trips to the dl with either a poission regression, negative binomial regression or an ordinal logistic regression and see which fit better.

[an error occurred while processing this directive] Posted 12:47 p.m., October 17, 2003 (#21) - Andrew Edwards
  Just a thought on sample design. There are two questions bundled together:

1) Is past DL time a predictor of future DL time?

Steve's study does a good job of addressing that, and some of the discussion since could refine it further. Overall, this is an important insight - there are plenty of teams who seem to behave as though DL time were random, and they've obviously got something to learn.

2) Are past injuries a predictor of future injuries?

I think it's important to distinguish this from the first question.

Two times on the DL could both be caused by the same injury. For instance, from what I understand of Ken Griffey, what's really going on with his legs is that his hamstrings are pretty much permanently shredded. They get worse and better through time, according to all kinds of variables, and this leads to time on and off the DL. But it's really just a single injury that leads to multiple times on the DL.

So if Griffey's legs act up in 2004, and he goes on the DL, and then again in 2005, and he goes on the DL, it's not that he was injured in both seasons. It's that a single injury never healed (and never will), and the pain just got to be too much for a while in both years.

He also, though, had a second injury this year when he dislocated his shoulder. This suggests that above and beyond having a chronic injury that makes him DL-prone, he also may have some strange attribute that makes him injury-prone. I'd like to see that investigated too.

Ideally, we'd also control for team, although we're a few years away from that.

Posted 1:03 p.m., October 17, 2003 (#22) - tangotiger
  That's a good point. If you have an injury in which it's expected to be recurring, then we're not really addressing the issue of random injuries.

So, the selection of injury-prone players would be those players who have an injury that is not expected to recur. In hockey and football, Troy Aikmen, Eric Lindros, Pat Lafontaine all have had multiple concussions, and I think we can recognize that this injury might be more likely to happen to guys who've already had it before, and we're not really proving anything.

What we want is really for a guy to have been on the DL for one ailment, and then been prone to be on the DL for another ailment.

You can also have a concurrent study of guys who pull their hammys and getting on the DL, and then getting back on the DL for the same ailment. That is, how recurring is an injury? In this case, a regression analysis would probably suffice.

Good point Andrew!

Posted 2:30 a.m., October 19, 2003 (#23) - RossCW
  This might tell us whether some teams are able to avoid injuries either by avoiding injury-prone players or by encouraging superior training/conditioning techniques ... Yet this study implies individual injuries can be predicted to some resonable degree of accuracy, and so therefore a team should be able to at the very least make personnel decisions that are likely to reduce team injuries

Or it may be that some teams use the DL more often while others tend to keep players on the active roster for minor injuries.

I suspect that most teams tend to keep star players on the roster while they recover from minor injuries where they would DL another player. They may be used to DH or pinch hit even when they aren't able to play in the field.

Posted 4:05 p.m., October 21, 2003 (#24) - J Cross
  I'd be interested to see the DL days for Will Carroll's green, yellow and red light players. How do the green players compare to the red players (taking out pitchers and catchers)? Did he predict injury (or at least DL days) as well as past injury did?

Posted 4:16 p.m., October 21, 2003 (#25) - tangotiger
  J, that's a great idea!

I have longed been annoyed at the prognosticators that did not have the decency to revisit what they've said, before going on to their next Nostradamus projections.

Tom Tippett is one of the very few that has actually laid it all out. Voros did this as well for a few years.

And, this is not just baseball, but in stock picking, weather, football lines, and any forecasting model. As far as I'm concerned, if someone makes a set of forecasts, he should be obligated to go back and look at how well he did, and let the readers know.

Posted 5:58 p.m., October 21, 2003 (#26) - Steve Treder
  "What we want is really for a guy to have been on the DL for one ailment, and then been prone to be on the DL for another ailment."

Good point, although from the point of view of teams, it doesn't really make much difference whether the guy keeps going on the DL for a recurrence of the same injury or a new unrelated injury -- he's still out of action either way. And I guess my intuition is that a player who is likely to have suffered the recurrent type of injury in the first place is likely to have a body type (and/or conditioning regimen) that makes him susceptible to other injuries too. But the distinction would be interesting to try and test.

I'm pleased (and a bit surprised, frankly) that my little internet-equivalent-of-a-bar-bet has stimulated so much interest. Will Carroll has contacted me, and says he is probably going to write it up in an article on Baseball Prospectus. Science marches on!

Posted 11:58 a.m., October 22, 2003 (#27) - J Cross
  Nice work, Steve. I think it's a surprising yet convincing result. I would have bet a beer against it in March. Also, 31 games! (or something of that magnitude) That would be an important result in terms of evaluating signings and trades.