Tango on Baseball Archives

© Tangotiger

Archive List

DRA Addendum (Excel) (January 16, 2004)

© Michael Humphreys

The link is to an Excel file providing position-by-position charts that show the DRA per-season ratings for each player who (click Discussion)
(i) is listed in Win Shares as having played at least 5,000 innings at his position and (ii) played at least five full-time (130+ games) seasons (not including seasons split between two teams) during the 1974-2001 time period. (For catchers, I only provided ratings for catchers rated by Tango’s system, and I included them even if they had only three 130+ seasons.)

As was done in the 1999-2001 study shown in the DRA article, the team rating at the position is credited to the full-time player. (Further below I'll explain why I did this.) Seasons in which the player played only 130-145 games (80-90%) are italicized. The Appendix has been formatted so that if you use the option under Excel to “shrink” the page to 80% and print out in “landscape” format, each chart for each position will print out on its own page for easy review. As the charts must be printed out in “landscape” format, they can’t be shown in the “body” of this article. However, I’ll summarize here, position-by-position, the overall ratings and discuss any surprises.

Providing an “overall” assessment of each fielder is a complicated task. In The New Bill James Historical Baseball Abstract (the “Abstract”), each player’s overall rating is based upon his average performance under several different criteria, in order to try to reconcile the differences in career and peak performance and factors that escape quantification. Bill rates players based upon the following criteria: (i) the average of the player’s three best seasons, whether or not consecutive, (ii) the player’s five best consecutive seasons, (iii) the player’s per-162 game performance over the course of his career, (iv) the player’s grand total value (career Win Shares), rescaled so as to be of approximately the same scale as the other averages, and (v) a subjective element.

In general, the overall DRA assessments provided in the DRA article (Excellent, Very Good, Solid/OK, Poor) are based upon the “Index” (shown at the far left of the file), which is the average of (i) the average DRA rating for the player’s five best consecutive seasons (which are “boxed” in the charts in the Appendix), excluding the severely strike-shortened 1981 and 1994 seasons, unless such seasons were necessary for the player to “reach” five seasons, (ii) the player’s “career” DRA rating (i.e., the sum of all of his ratings for seasons in which he played 130 or more games at the position), divided by five, and (iii) the average of the player’s second- and third-best ratings, whether or not consecutive or included in their top-five consecutive season ratings, but only including seasons of 146 or more games and only if the rating is above-average (above zero). (At catcher I accepted the best and second-best 146+ ratings, if available, given the lack of data.) At times I will point out reasons for incorporating subjective considerations.

The rationale behind this approach will become clearer as you go through the ratings, but in essence what we’re trying to establish is a good indication of the player’s significant and reliable contribution with his glove. When I rate a fielder as “Excellent”, that means the evidence strongly suggests he had the clear ability to save close to 20 runs a season over an extended period of time, say, five or more seasons. Such players should have a few Gold Gloves to their credit, assuming voters were well-informed and thoughtful (they very often weren’t). “Very Good” fielders had the clear ability to save about 10 runs a season for an extended period of time. They usually deserve to win a Gold Glove or two, depending on the competition in a given year. Solid/OK guys were basically average—actually, usually slightly above average, as players who play full-time are usually better-than-average fielders, as Bill James has observed and quantified. “Poor” fielders cost their teams more than 10 runs a season, compared to league-average fielders, over an extended period of time. (Yes, the “run-spreads” (runs per season) are “tighter” here than those I mentioned in the Introduction and Part II of the DRA article, because two- and three-year performances have wider swings that five-year-or-longer performances.)

The five-year “peak” rating is probably the best single estimate of the quality I’m trying to capture in the historical ratings. The “career” ratings are also important, however, even though the accuracy of the career rating is limited by the fact that part-time seasons are not included, and even though career ratings include “decline” phases, because data beyond the five-year peak provides additional evidence about the player’s ability.

The second- and third-best seasons of over 90% playing time are a good confirmation of the player’s true peak ability, independent of whatever residual distortion is left in the system by using the team-level rating. By throwing out the “best” (146+) season, we prevent fluke seasons from seriously distorting our estimates. (There is only one truly “weird” single-season rating in the study, but we’re trying to be conservative here.) In addition, a player may sometimes have a good season outside of his five-year peak, and it seems fair to factor this in. For those players for whom we lack seasons of greater-than-90% playing time, that very lack of data is something that should be considered. Only above-average “peak” ratings are included, as I found that one or two bad seasons for some players significantly distorted their overall Index. The Index for poor fielders is therefore somewhat shifted toward the mean, but for purposes of all-time rankings I’m more interested in evaluating good-to-great players correctly than in assessing precisely the negative impact of the truly poor fielders

On average, the three “objective” ratings numbers are usually approximately the same. A typical pattern for a Very Good fielder would be a +10 five-year average rating, a “career” runs-saved rating (excluding seasons of part-time play) of +50, and a two-year average peak rating of +10. In other words, for the players who managed to put in more than five full-time seasons, they tended to be average (a “zero” rating) during their non-peak periods. But for those players who played well outside of their five-year peak, taking the career rating and dividing it by five gives them credit for sustained excellence. For players who failed to play two 146+ seasons, that “gap” in their record causes their Index to be shifted closer to an average Index rating, to reflect our slightly lower level of confidence in their individual performance, independent of the effects of their substitutes.

All of the per-season ratings are provided in the charts, so you can devise your own method of summing up the data. I just thought that this was a simple approach that worked reasonably well.

The Index numbers in the attached Excel chart are a little different from those provided in the DRA article, primarily because I decided not to include negative 146+ seasons when calculating the "Index" you see see in the Excel spreadsheet (for reasons explained above) and because I caught a few very minor mistakes. In any event, the numbers in the attached file are the latest, and "trump" any differing numbers in the original DRA article.

The single-season ratings for each individual full-time player is equal to the DRA rating for his team at the position he played. There are empirical, theoretical and practical reasons for doing this.

First, even if I had exact innings fielded data and used it to “individuate” the ratings, DRA ratings wouldn't necessarily be better for full-time players. In some data samples, I used the 2000-2001 baseball-reference.com innings fielded data to see if I could get a more accurate match with UZR for the DRA/UZR/Diamond Mind comparison in the DRA article. The results were inconclusive—sometimes slightly better, sometimes slightly worse. There are very good explanations why this happened. The 130-game requirement is a cut-off point—but the average number of games played per full-time season in the test is probably closer to 146, or approximately 90%. If a player plays that many games, there remains a significant risk that the contextual factors in the small sample of games the player didn't play in won't be representative, just based on randomness–so if you try to “individuate” the rating of the part-time fielder by pro-rating the team rating and adjusted it up or down based on the part-time players’ plays made per inning fielded relative to the team average, the part-time players’ rating (and, therefore, the full-time player rating, which is just the team rating minus the part-time players’ rating) will be off. Think about it another way—how much impact could an exceptionally good or bad back-up fielder have on a team’s rating in 16 games? Or even 32? The maximum in remotely normal cases should be +/- a handful of runs per season.

Second, the UZR team/position/player ratings demonstrate that, at least for two- or three-year average ratings for players who played 130+ games per season, there is no meaningful difference between the team and the individual player UZR ratings. None. (In Part II.C of the DRA article I include the team UZR rating as well as the individual UZR ratings to show this.)

Third, by demonstrating in the 1999-2001 study provided in the DRA article that the impact of part-time fielders always randomizes away within two or three years, I hope to reassure you that the career ratings covering five or more seasons (provided here in the attached Excel spreadsheet) are basically reliable, even though they also assign the team rating to the individual who played at least 130 games for the team at the position.

If I were to do a comprehensive Win Shares-type project with the help of a computer programmer, I would use innings fielded data and the Bill James method for estimating innings where we lack the data, and the resulting estimated DRA per-player ratings would probably be reasonably good for part-time players (with one possible exception described in Part IV.C of the DRA article) and, to the extent they were off, the smaller number of games would not yield any substantial ratings distortions on a career basis. I just don't have the time or resources right now. For purposes of evaluating the truly significant fielders throughout history, the current method is, I believe, more than adequate, with the rare exception of a Gil McDougald or Jim Gilliam, who played outstanding defense at a number of positions part-time each season.

Thanks again for reading and for your interest.

--posted by TangoTiger at 09:14 AM EDT


Posted 11:45 a.m., January 17, 2004 (#1) - Mills
  What's in the box?

Posted 2:35 p.m., January 17, 2004 (#2) - MGL
  Again, terrific stuff and well-written! People don't realize yet that DRA is by far and away the best metric out there using traditional fielding data (not PBP data). It will some day be the gold standard! I am working on "converting" UZR into a linear weights formula that can be used with traditional fielding stats only. The results of that should be almost exactly the same as DRA (hopefully).

There a million thoughts that come to mind as I peruse the Excel file. That the best hitter in baseball (by a wide margin of course)and perhaps of all time is also at the top of the defensive list should boggle the mind! And the fact that Buckner is at the top of the defensive firat baseman list and yet he is remembered for that one fateful gaffe is sad!

I don't recall you publicly releasing your "linear weights" formula, Mike, but did you publicy release the data that is used to compute for each player's DRA, especially for catchers? For example, for a SS, you "need" assists, putouts (do you need putouts, for example), team pitcher data inlcuding G/F ratio, handedness, etc.??

Posted 7:32 p.m., January 17, 2004 (#3) - Michael Humphreys
  Mills,

The "boxed" numbers are the best five consecutive ratings for the player. The average of those numbers is included in the Index.

MGL,

Thanks. I do take pains in the article, however, to explain why the Buckner rating is incorrect, possibly to a large extent. I agree, however, that he was not a bad fielder.

The data used is BFP, IP, SO, BB, HBP, H, HR (2b and 3b at league level), PO per position, Assists per position, E per position (though regressions suggest that only Errors at pitcher and right field matter), WP/PB/BK, SB allowed, DPs, Runs Allowed. Out of that data the GB/FB and L/RHP factors are calculated. The format of the equations is provided in the first installment and in a thread to the third installment.

That's a great idea to create linear weights equations using UZR data. You'll probably get a great fit. Then you can apply the equations to pre-UZR years. It should work very well.

Posted 8:04 p.m., January 17, 2004 (#4) - David Smyth
  To those of us who don't (yet) have excel (yes, it's true, and I'm embarassed), why not just put a "leaders" post or whatever here, so that everyone can share in the glory of DRA (absolutely no disrespect or satire intended).

Posted 9:48 p.m., January 17, 2004 (#5) - MGL
  Daved,

You can order a "trial" version of Microsoft Office (I think it only lasts for a month, but I'm not sure if you can figure out how to reinstall it) for like $5.00!

Posted 10:48 p.m., January 17, 2004 (#6) - DW
  Michael,
I don't think I've commented on your efforts in any of the other threads, but I want to congratulate you on some very nice work. Good job.

Posted 11:20 p.m., January 17, 2004 (#7) - Tangotiger
  I'll post a PDF version of Michael's file on Monday.

Posted 8:44 a.m., January 18, 2004 (#8) - David Smyth
  Thanks, Tango. And MGL, it's not the money, it's laziness.

Posted 10:01 a.m., January 19, 2004 (#9) - tangotiger (homepage)
  PDF file can be found here.

Posted 10:21 a.m., January 19, 2004 (#10) - Michael Humphreys
  Tango, thanks for the PDF post. The file looks great and is much easier to read than the Excel spreadsheet.

Posted 11:03 a.m., January 19, 2004 (#11) - Sylvain(e-mail)
  As DW said:

Thanks a lot Michael for your DRA articles and the amount of work put into it and the sharing of the results.

Sylvain

Posted 1:56 p.m., January 20, 2004 (#12) - FJM
  The results of your study are certainly thought-provoking. I'm troubled though by the amount of year-to-year variation shown for each player. I computed the mean and standard deviation for each of the shortstops and then created a confidence interval about the mean 1 s.d. wide. (Normally I'd use a 2 s.d. conf. interval, but if I did that here I wouldn't be able to conclude much of anything.) The results: out of the 33 SS's shown, only 5 are significantly better than average (i.e., mean - st.dev. > 0.) (If I used 2 st. dev., only Mark Belanger would qualify, and we can't even be sure about him because half his career is missing!) Meanwhile, only 1 SS (DJ) is significantly worse than average (i.e., mean + st.dev. < 0.) Everybody else could be either better or worse.

In an attempt to improve the resolution I threw out all the italicized years (i.e., years in which they played fewer than 146 games.) That cut the data base down by more than one third, from 256 years to 161. And it did reduce the standard deviations somewhat. (Using the full data base, the average SS had a st. dev. of 11.0. In the smaller data base, it was 10.2.) Frankly I'd been hoping for, and expecting, a bigger reduction than that. The results do improve a bit, with 9 players now coming out above average and 2 being worse. But that comes with a price. One player (Craig Reynolds)was thrown out entirely since he had only one qualifying year and so had no standard deviation. 5 others had only 2 qualifying years, which means their computed standard deviations are probably understated. If I throw those 5 out as well, we're back to 7 players better than average and only 1 worse. And 2 of those 7 make the cut by the tiniest of margins (0.03). Essentially, we're right back where we started.

Posted 3:07 p.m., January 20, 2004 (#13) - tangotiger
  Using my "True Talent" fielders generated from UZR, 166 of the 402 (41%) fielders were within 1 SD (using an average of about 1000 BIP per player). The spread of fielding talent, according to UZR, is about twice that which you'd get from random.

1 SD corresponds to about 5 runs per 600 BIP.

Posted 4:00 p.m., January 20, 2004 (#14) - Mike Green
  Dale Murphy's poor showing, and Glenn Hubbard's excellent one, on the early-mid 80s Braves led to questions about the adequacy of the flyball/groundball staff adjustments in extreme cases. I'm wondering whether the 86 rating for Omar Moreno is available to help see whether Murphy was really as bad as he looks.

Posted 4:46 p.m., January 20, 2004 (#15) - Charles Saeger(e-mail)
  Again, terrific stuff and well-written! People don't realize yet that DRA is by far and away the best metric out there using traditional fielding data (not PBP data). It will some day be the gold standard!

DRA is solid, but there's nothing I can see that shoves it above other metrics. It's another look. Most of its findings are right in line with Davenport's (CAD is usually pretty in line with DFT also), which means to me that we may have reached a point for fielding data where we need to take another approach. All systems will fade as we take them back farther in time, DRA probably more so since it starts with PbP and reconciles it with traditional data. Those relationships probably don't hold up as well in 1910, and we would need to adjust everything accordingly.

Endorsing a system without being able to see its underlying nuts and bolts is a bad move, even if it uses one's own system as a basis.

Posted 10:16 p.m., January 20, 2004 (#16) - Michael Humphreys
  FJM,

Fair point. To use scientific terminology, it's rare that you can reject the null hypothesis that the fielder is average, even if you use the very weak p-stat test of .17 (one standard deviation).

But the same applies to UZR ratings, or nearly so, as far as I can tell, because the year-to-year "r" is about the same for UZR and DRA (approximately .5). It might be slightly higher for UZR, but not by much. Time will tell whether David Pinto's probability-based ratings have a higher year-to-year "r" than UZR.

I seem to recall some e-mail thread, perhaps at fanhome, regarding the construction of confidence intervals for a single season rating if you have a series of such ratings and the year-to-year "r" information. It might be the same thing you've done, FJM, but I seem to recall there being some interesting wrinkles. Tango, do you recall the thread?

It might be an interesting exercise to calculate the proportion of players who, based on their year-to-year *offensive* performance, were reliably better than average at *hitting*, using the one-standard-deviation test. Of course there would be at least several, but probably a lot fewer than you would expect.

It would also be interesting to calculate the Linear Runs from BABIP for hitters, to find the hitters who were reliably better than average in BABIP. The number of "significant" above-average BABIP performers would probably be much smaller than for overall batting. Nevertheless, fans routinely include BABIP value in overall offensive value. (Though, as MGL has pointed out, you need to regress that component significantly if you're trying to find a true "ability" measure.)

Ultimately, there is always an important distinction between statistical significance and practical significance. Even if DRA ratings lack the former, they may still possess the latter, given that, as far as I know, no other ratings have (meaningfully) greater statistical significance and that teams *must* make staffing decisions using the best information available. For purposes of rating minor league fielders, lacking UZR, it would be good to use DRA. And, as explained in the article, it's also good to have DRA as a "back-of-the-envelope" check on extreme UZR ratings.

Mike (Green),

I point out in the DRA article that one of the Glenn Hubbard ratings is clearly wrong. Aside from that one number, they seem consistent and not extreme. See the Dale Murphy discussion, which points out that the Atlanta team rating at CF went up when Murphy missed a lot of time (thus "controlling" for whatever mismeasure there is for GB/FB). I am very confident that Dale was a below average centerfielder, for a host of reasons provided in the article.

Any GB/FB adjustment could be wrong. Any rating could be wrong. DRA deals with averages and statistically significant relationships (even if, as FJM points out, the resulting ratings might not have "statistical significance").

Charles,

DRA does *not* use PBP data. As explained in the article, DRA can be made to work for all periods of baseball history, though with less confidence for outfielders, for whom we lack separate LF/CF/RF putout data. For all infielders (including catchers) DRA works exactly the same throughout major league history using only publicly available data. I've also been developing some neat ways of estimating LF/CF/RF putouts for pre-Retrosheet seasons, and have obtained some nifty results.

DRA differs meaningfully from DFT, based upon how DFT is described in Mike Emeigh's "Jeter" articles. Clay's website says that the DFT methodology has been updated (including by ignoring errors--maybe Clay read the DRA articles).

First, assuming DFT's GB/FB estimate is the same, it is not as accurate as DRA's. I know because I checked by using Clay's method and comparing the results with UZR. One consequence of this is that DFT outfielder ratings are much more compressed than DRA ratings (which are more compressed than UZR outfielder ratings). DFT almost certainly understates the value of Willie Mays and Tris Speaker. Take a look at their DFT ratings and you'll see what I mean.

Second, DFT gives credit for infielder pop-ups/fly outs. Diamond Mind, UZR and DRA don't. One consequence of this approach is that, eyeballing the data, DFT infielder ratings seem to have more year-to-year variability, though I haven't measured this. I still have not read a compelling case, including a full disclosure of methodology, for obtaining estimates, using traditional data, of infielder fly outs that reflect *skill*--not FB pitching, ball-hogging, and larger foul territories.

Third, DFT *forces* runs-saved ratings at each of the positions for the team to add up (with pitching runs) to the actual number of runs allowed. Now on one level, this is a good thing: the system is fundamentally accountable, as is Win Shares. But I think it may result in a small sacrifice in the accuracy of individual fielder ratings.

Why? Well, the analogy would be to how Bill James calculates *batting* Win Shares. Bill James applies the formulas for the individuals on a team, and if the total differs from the team runs scored, he proportionally adjusts every batters rating. But there is no basis for concluding that *each* player had a pro-rata impact on the relative efficiency or inefficiency of the lineup in converting the components of offense into runs. I would venture to guess that "unforced" offensive ratings have a higher year-to-year "r" and come closer to providing the true "value" measure for the player. It's probably not a big deal one way or the other, but I would imagine Clay's approach increases computational complexity.

DRA estimates of fielder and pitcher runs add up to an *estimate* of runs allowed, and such estimate is reasonably close, on average, to *actual* runs allowed, in the same way that regression analysis (or Linear Weights) estimates of team runs scored are close to actual runs scored. But there is no second adjustment to "force" a "perfect" fit.

Fourth, the DRA equations (the format of which is provided in the article) together constitute the simplest comprehensive system for ratings pitchers and fielders ever devised. I appreciate that you have to trust me on this to one degree or another. But try reading Mike's (well written) summary of DFT. Or Win Shares. In neither case can you reduce the calculations to a one-line equation per position involving only simple addition, subtraction, multiplication and division.

I agree that no system deserves an unqualified endorsement until it's been fully disclosed, everybody has the means to replicate its results, and it's possible to *test* those results against *more than one* alternative "correct" system. Not to get too high falutin' about it, that's how real science is conducted. Unfortunately, the only fan-implementable fielding systems out there that are fully disclosed are Win Shares and CAD. (The data for UZR and Pinto's systems is proprietary.) Win Shares simply does not provide ratings that match well with UZR or Diamond Mind. CAD may, I just don't know.

My ultimate goal in writing my book is to make the *full and complete* case for DRA being a rigorously tested, reasonably accurate, very simple, and conceptually coherent system that any fan can apply using publicly available data. Should I ever get the damn thing published, it will be the only system for evaluating fielders satisfying those criteria.

Posted 11:51 a.m., January 21, 2004 (#17) - FJM
  It might be an interesting exercise to calculate the proportion of players who, based on their year-to-year *offensive* performance, were reliably better than average at *hitting*, using the one-standard-deviation test. Of course there would be at least several, but probably a lot fewer than you would expect.

It's a long way from the study you suggested, but here's an interesting observation. 8 of the OPS Top 15 qualifiers repeated in 2003. (It would have been 9 out of 15 if Vlad G. had had a few more PA's.) Out of the remaining 6, 4 finished in the next group of 15. Only Larry Walker and Mike Sweeney slipped significantly, and even they were still better than average.

Posted 9:16 p.m., January 22, 2004 (#18) - Charles Saeger(e-mail)
  DRA differs meaningfully from DFT, based upon how DFT is described in Mike Emeigh's "Jeter" articles. Clay's website says that the DFT methodology has been updated (including by ignoring errors--maybe Clay read the DRA articles).

I never said it did not. Read what I wrote again; I said that the *results* do not differ significantly. And I never, ever discussed Win Shares, which I have mentioned many times does not have replicatable results. I actually like having a second context system, but especially with separating multiple fielders at one position, it has issues.

(And DFT was fully explained in a BP. I'm not extolling it, I should note, but its results are widely available.)

And I for one, with modern computers, see no reason why formulae need to fit in one line. The structure, yeah, that should be easy to explain, one line. Ockham's Razor leads us to prefer simplicity when we can. But when one needs an extra computation, there's no reason not to use it.

Win Shares simply does not provide ratings that match well with UZR or Diamond Mind. CAD may, I just don't know.

That isn't the only point of a system (the point being to make accurate and reproduceable ratings), and that's what I meant by PbP data, reconciling to UZR rating. Mike's articles show that even those are hardly perfect. (FWIW, there is no systematically published "Diamond Mind" ratings, but you knew that. There's a set of Pr/Fr/Av/Vg/Ex grades, but those aren't good enough for research purposes, since they're somewhat subjective. And those count errors for a fielder, or use catcher or outfield arms. You know this, but some folks see "Diamond Mind ratings" and assume there's a number floating out there and start looking for it.)

Posted 9:19 p.m., January 22, 2004 (#19) - Charles Saeger(e-mail)
  MH -- just to add, I think you're reading more into my comments than is there. Indeed, aside from the first baseman's putouts issue, I don't disagree much with anything you wrote in your articles.

Posted 1:14 p.m., March 12, 2004 (#20) - tangotiger
  This was posted by Miko:

=========================
Posted 11:50 a.m., March 12, 2004 (#21) - Miko
To those of us who don't (yet) have excel (yes, it's true, and I'm embarassed), why not just put a "leaders" post or whatever here, so that everyone can share in the glory of DRA (absolutely no disrespect or satire intended).

FYI:

One can almost always open excel spreadsheets of the sort MHumpreys provides with the spreadsheet program "Calc," which is included in the open-source and freely downloadable Open Office.

OO may barf on excel files which uses lots of graphs and such, but so far I've been able to open all of the ones linked to on BPrimer.

Oh, and thanks, tango, for releasing data in .csv format so often. Data portability is underrated.