Tango on Baseball Archives

Patriot: Baselines (September 17, 2003)

Take 20 minutes to read this. It's very much worth your while. An excellent piece on replacement level, average, Win Shares and more!

I especially liked
The replacement paradox is essentially that, using a .500 baseline, a .510 player with 10 PA will rate higher than a .499 player with 500 PA. And that is true. But the same is just as true at lower baselines. Advocates of the minimal baseline will often use the replacement paradox to attack a higher baseline. But the sword can be turned against them. They will say that nobody really cares about the relative ratings of .345 and .355 players. But hasn't a .345 player with 500 PA shown themselves to have more ability than a .355 player with 10 PA. Yes, they have.
--posted by TangoTiger at 11:39 PM EDT

Posted 1:06 a.m., September 18, 2003 (#1) - Michael Humphreys
Excellent article--maybe the best on replacement value that I've seen, and certainly the most balanced and comprehensive.

I ultimately agree with Patriot that it's best to provide two or three baselines. For position players, perhaps .500, .425 and .350 would work. I'd actually prefer a "rate" metric based on runs (including defensive runs) per 162 games, because it's easier to translate into wins/losses. For people putting together reference books, aside from the massive work in figuring out the levels, it would only involve adding an extra column or two to ratings: e.g., TPR at .500 level; TPR at .425 level, etc.

For making career ratings of all-time greats, I don't think it's generally a good idea to compare them to the .350 level, because no team would allow a player to stay for any length of time at that level. The Tango/Nate Silver models are neat.

For major league organizations trying to build an actual team of 25 players, half of whom won't play that much, multiple levels make sense. Teams don't have eight positions and pitcher: they have eight full-time positions, utility players needed in reserve, five starting pitchers, a closer or two, some set-up pitchers, etc. To be simplistic about it, there are roles for role players and different pools of replacement talent for each role. If your team needs to fill a role, you look at the population of players who can fill it and pick the best one. You never go out and get a .350 utility infielder to replace your starting shortstop. You may have to accept using one for a while, but you won't for long.

Chaining is so hard to model, especially for pitchers. Again, would one include mop-up guys in the "replacement pool" if you lost your #2 starter? No, you'd probably work your #1, 3, 4 and 5 guys more. It's a real mess, and I don't know how to model it. Unfortunately, it really has a great deal of relevance in Cy Young awards and in putting together good career ratings.

Posted 9:27 a.m., September 18, 2003 (#2) - tangotiger
My personal preference is to present TWO figures,
the "Wins" and "Loss" or
"Runs scored" and "Runs Allowed" or
"Player x" and "Average"

and then let the reader decide how to manipulate the numbers.

If you want win differential, fine. If you want 2*wins - losses, fine too.

As Patriot noted, we each have our own objectives and questions to answer. Providing two numbers allows all those objectives to be answered individually, instead of having the Win Shares or TPR model imposed on us.

Posted 10:23 a.m., September 18, 2003 (#3) - Patriot
I'm glad somebody was amused by that article.

I know(or at least I think I know) what David will say if he reads us advocating multiple or pick-your-own baselines, which is that for the "general question" of value, you need a single baseline, and that the minimum level of ability needed to play in the majors is the best one.

I agree with Michael that chaining is very hard to model, and therefore may be a good idea but not a very practical one.

I don't know if anyone thought anything of the "multi-tiered" idea I had at the end of that article, and I realize that it is very unwieldly. My point behind it though, is "does a .390 player have value to helping a major league team win more games then they would with a FAT or whatever you want to call him player in his place"? Absolutely. But if I make that .390 player bat 500 times, does he have value? Well, he still does over the FAT player, but using a .390 player as a regular is really going to kill me. He might have value for 100 or 200 or even 300 PA, but after that, you should have somebody better. Of course, this will rightfully be rejected by someone who assumes that you are measuring value to a team with all FAT players. I guess the "multi-tiered" model implicitly assumes that you are dealing with an average type ML team.

Posted 10:50 a.m., September 18, 2003 (#4) - tangotiger
I agree with Michael's assertion about the comprehensiveness and balance to the article.

The tier-ed approach is also excellent because it attempts to model reality, and I'm a big proponent of that line of thinking.

I also agree with the quote of Patriot that the "playing time issue" and the "negative /positive" paradox is not limited to a .500 baseline but to any baseline. This was well-described by Patriot.

In terms of paying someone money, you'd pay someone the minimum to perform the minimum. Kinda like a college graduate coming in as a stage at my company. Any marginal performance above this marginal player gets money at a multiple of this marginal performance.

(You can of course get a non-linear relationship, but I don't believe in that, unless you factor in playoffs.)

Therefore, what a team pays is based on overall team performance. If you lose Vlad, you have a chaining process so that the team won't be as bad as replacing all his PAs with a schlub.

A team pays based on the marginal change to the overall team, but crediting that change to the variable that changed.

It's a fascinating topic, and as Patriot points out, there's no 1 right answer. From this standpoint, Pete Palmer and Bill James should listen and read Patriot's article.

Posted 6:57 p.m., September 18, 2003 (#5) - David Smyth
There is NO baseline which is inherently "correct". Therefore it is silly to speak in those terms. This is what leads to these "politically correct" ideas to hold them all as "equal, just different", or to present a column with 3 of them, for the reader to pick his fave. Tango's suggestion just to present that raw materials and let the reader apply his his own baseline transformation is, technically the wisest course, but not very user-friendly.

So I look at all this, with understanding of all of the chainist, progressivist, etc., constructions--and I have no doubt that, if I had to pick just one as the best blend of quality and quantity rating, I would go with the "minimalist", as Patriot christens it. If you don't want to be forced to pick one, you can follow Tango's approach. I would not be very excited to see a rating list according to chaining.

Patriot writes that most analysts prefer the minimalist level as the "defacto" level. There is a good reason for that.

Posted 6:58 p.m., September 18, 2003 (#6) - David Smyth
Oh, I forgot to mention--great work, Patriot!!!!!!!!!!!!!!!!!!!!!!

Posted 7:47 p.m., September 18, 2003 (#7) - Gary L
I love rutabaga!

Posted 5:10 a.m., September 19, 2003 (#8) - Michael
They will say that nobody really cares about the relative ratings of .345 and .355 players. But hasn't a .345 player with 500 PA shown themselves to have more ability than a .355 player with 10 PA. Yes, they have.

Yes, but when we talk about what players have shown in the past an observerd .345 player with 500 PA has less VALUE in the past than the .355 player who has 10 PA (assuming a .350 FAT rep. level). And if you knew with certainty that A is a .345 player a prior and B is a .355 player and that you had infinite FAT at .350 and you knew that you could get up to 500 PA from A but only a max of 10 PA B that you'd want to have player B in your organization, but you wouldn't care to have player A in your organization because you'd rather just replace A with a FAT.

Posted 8:28 a.m., September 19, 2003 (#9) - Tangotiger
What if the true FAT line is .340? .330? .300?

Posted 10:04 a.m., September 19, 2003 (#10) - Patriot
Yeah, what Tango said. We can't really know for sure what the FAT line is. And I don't think there's really a line. There's no magic divider point at which there are suddenly tons of available players. It's a curve. There are more .700 players then .800 players, and in turn there are more .600 players then .700 players, etc. This gets steeper as you go of course, becuase there are 4 billion .000 players available. But there's still no one magic number IMO.

Posted 10:26 a.m., September 19, 2003 (#11) - tangotiger (homepage)
I just want to point people who may not have seen it to the above link. It's my theoretical work (with some empirical data to support it) on the talent distribution in MLB and around the world.

I think it's easy to see that while the ideal line might be at 80% of MLB average, that the non-uniform distribution of talent at the team and position level would make it a very non-stationary line.

Again, depending what you want, using an average baseline is perfectly fine. Going forward, as Patriot noted, all you want is a rate stat. You just need to know that this player is a "101" and that player is a "98" and it's irrelevant that the average is "100" or that the minor leaguer is a "75" or whatever. 101 is better than 98.

Going backwards, the 101 might have contributed 1.1 wins and 1.0 losses, while the 98 might have contributed 3.9 wins and 4.0 losses. Playing time is a consideration. While the 101 did contribute more than the opponent that he actually played against, the 101 contributed LESS than the opponent that he did NOT play against (because he was on the bench at the time). There's an opportunity cost in sitting down and not playing, and that's in letting someone else, presumably worse than you, play.

Posted 11:31 a.m., September 19, 2003 (#12) - JK
I don't agree with the replacement paradox; or, more to the point, I accept its literal truth but reject its significance. The relevant question seems to me to be: given an established level of (empirical) performance, which player would a team prefer to take. All else equal, a team may prefer to take the .499 player with 500PA over the .510 player in 10PA because the former has proved to have significant value whereas the latter has proved very little. The same reasoning can be used to show that the true talent level of .345 player in 500PA is higher than a .355 player in 10PA. But that does NOT mean a team would prefer to take the former player, because all he his done is prove he is worthless to the team, whereas with the latter there is some chance (even if small) that his true talent level is much higher.

Ie if one takes the point of using a replacement baseline as being to assess the line at which a player ceases to have value to team relative to avialable substitutes, as opposed to an attempt to get at absolute true talent levels, there is no paradox.

Posted 11:48 a.m., September 19, 2003 (#13) - tangotiger
This assumes that the replacement line is fixed, whereas the likelihood is that the replacement line is centered around some point, say 80% of league average, with a distribution around it, of say 1 SD = 3%.

So, the question is: what is the probability true talent distribution of the .345 player in 500 PA? You might say that it is centered at .355, with 1 SD = .020. We can never know for certain what his true talent level is, so the best you can do is come up with a distribution of what his true talent probably is.

Then, you ask the same question about the .355 player in 10 PA. Maybe it's centered at .370, with 1 SD = .050. You are less certain of his true talent level, and therefore, you distribution is much wider than the first guy.

Now, overlaying on these 2 distribution is the "replacement-level" true distribution. And again here, we don't have 1 fixed point. The true level might be .350, with 1 SD = .01.

Finally, the question you can ask is: what is the probability that the first player is above a replacement-level player? In essence, what the chance that a .345 in 500 player (or a true .355 player, +/- .020 = 1SD) would "win" against a .350 +/- .01 player?

And you ask the same question of the .370 +/- .050.

Posted 7:54 p.m., September 19, 2003 (#14) - David Smyth
Trying to figure what is the probability that a player is a repl player was the method used by B James in (I think) the '82 and '84 Abstracts.

All of these repl schemes can be looked at in terms of the context they imply.

Patriot and Tango prefer the context of an "avg team", with typical players in all of the roles. This allows you to do such manipualtions as chaining and the "tax bracketing", etc. There is certainly a good deal of logic to this approach.

The "minimalist" approach implies a context of a team of all FAT players. Since this is less "realistic", how can it be justified? A context of an avg team is not a context of "no value". An avg team, with a 60 million$ payroll, contains a good deal of value within the system. A reference team with no value (within the baseball universe) would be a FAT team. There is no such thing in real baseball, but there are obviously loads of below-avg teams, and there are the Det Tigers, with a record not very far from that of a repl team. It's not hard to imagine a financially-strapped team sending out 25 scrubs, at a total cost of about 7.5 million.

So this viewpoint tells a player's "maximal" value. If, in your rating system, you want to be as inclusive as possible, and use a sort of zero context which does not mask real differences between players, then a minimalist approach using a .350 or so baseline for all players is appropriate.

Posted 7:56 p.m., September 19, 2003 (#15) - Smyth
I meant '82 and '83 (not .82 and '84) for B James.

Posted 8:12 p.m., September 19, 2003 (#16) - Patriot
You are correct about the various approaches that I prefer dealing with an average team construct. But also remember that whatever RC number I'm using in there is also based on the player being on an average team. It seems to me that if you are going to use the approach of a minimum team, you should do it the whole way through, including your RC number. Of course, I have seen David many times talk about using a replacement team rather than an average team in a TT construct and such things, so it seems like something he has considered.

Posted 8:05 a.m., November 12, 2003 (#17) - tangotiger
Bringing this forward for those who missed it.

Posted 10:01 a.m., November 12, 2003 (#18) - Joshua
Patriot, thanks for this comprehensive look at baselines.

I hope you can expand on your response to Bill James.

For instance, in Win Shares, he writes: "Total Baseball tells us that Billy Herman was three times the player that Buddy Myer was."� No, that's not what it's telling you.� It's telling you that Herman had three times more value above his actual .500 opponent than did Myer.� He writes "In a plus/minus system, below average players have no value."� No, it tells you that below average players are less valuable than their opponent, and if you had a whole team of them you would lose more than you would win.

I didn't get "his actual .500 opponent" at first, but I think now that that term refers to the sum of players Herman competed against over his career, which we can assume to be very close to average in aggregate.

But that's really beside the point. Isn't TPR both saying

(i) that Herman had three times more value above his actual .500 opponent than did Myer; and

(ii) that value exists in outperforming your actual .500 opponent?

Put another way, I don't see how saying

Herman had three times more value above his actual .500 opponent than did Myer

is meaningfully different from saying

Herman was three times the player that Buddy Myer was.

James purports to have a reductio for TPR. Your response is to deny that there's anything absurd at the bottom, but I'm left not understanding what it is TPR is saying, if you can't conclude from its ratings that Herman = 3x Myer, because that's certainly what it looks like on the face of it: Herman's TPR = 3x Myer's TPR.

I guess my bias is strongly against the average baseline, but I don't have any good idea what should be a baseline, and I'm suspicious of most of the replacement level numbers I've seen generated, for many of the reasons you give in your article. I'm firmly convinced, though, that there's value in being less than average, and any system that doesn't acknowledge this has a lot of work to do getting around this prima facie fact.

For now, I suppose the best ways to present the data would be to show many baselines. I don't really see how there could be a single baseline. I mean, shouldn't it even be manager-dependent? For example, the player Dusty Baker uses to replace an injured 1B might be different from the player Earl Weaver would use....

Posted 10:24 a.m., November 12, 2003 (#19) - tangotiger
Herman had three times more value above his actual .500 opponent than did Myer

is meaningfully different from saying

Herman was three times the player that Buddy Myer was

The first one is a relative scale, and the second one is an absolute scale.

You cannot, just cannot, perform your division/multiplication on a relative scale and think it's going to give you anything meaningful.

Compare -1 celsius to +1 celsius. Compare -1 runs to +1 runs. Compare +.0001 runs to +1 run. Compare +1 runs to +10 runs.

Why in the world would you try to do +1/.0001 ? Or +1/-1?

Now, if you had one player being "101" and the other being "99" (where "100" is average), then you'd be on firmer ground.

Posted 10:27 a.m., November 12, 2003 (#20) - studes (homepage)
Joshua, I tend to think scales like these are just like temperature scales. Is a 100 degree day twice as hot as a 50 degree day? I don't think anyone says that -- they say it's 50 degrees hotter. Why? Because zero is a rather arbitrary figure, at least if you're talking Fahrenheit. Of course, Celsius is a little less arbitrary, but it's not really that meaningful as a practical matter. And if you're talking five degrees below zero Celsius, how does that compare to five degrees above zero? Would we say "twice as hot" even if we used a Kelvin scale? I don't think so.

In my opinion, these value or ability scales are relative, not absolute, and should be discussed in that manner. I'm not a fan of multiple tiers. Way too confusing. I like Tango's concept: just present Wins and Losses and let the reader take it from there.

Posted 11:06 a.m., November 12, 2003 (#21) - David Smyth
The comments about the relative scale are correct; I mean, all of us here understand what TPR really means.

But James is also correct in his interpretation of what Palmer *intends* TPR to mean, because Palmer has stated that he uses .500 because a sub-.500 team cannot (usually) make the postseason. So he is implying that *real* value starts at .500. Given that, the interpretation by James of Herman/Myers is logical, and is indeed a *reductio*--relative to Palmer's stated interpretation.

Posted 11:19 a.m., November 12, 2003 (#22) - tangotiger
I agree on the issue of Palmer's intent. Palmer is wrong about his intent, and James is wrong for blasting wins above average, because of Palmer's intent. Just because Palmer misused it doesn't mean that the whole framework is wrong.

Posted 12:57 p.m., November 12, 2003 (#23) - Patriot
Where did Palmer say that bit about the sub-.500 team can't make the playoffs? I believe you, I'd just like to read it myself. Maybe in an early TB?

Anyway, in the last edition of TB, he gives a longer explanation that is a lot closer to what I would argue. I quoted it extensively(probably in violation of every copywrite statute :) in my article.

I agree that if he just says "below .500=no playoffs" that that is weak. The argument he used later was a lot more substantiative.

And I also agree that James may be right in assesing Palmer's intent, but he bashes "Linear Weights" over and over again. I realize that "Linear Weights" is the name of the system that Palmer invented, but it's unfair IMO to not point out the distinction for your readers(which James didn't)...especially when you're own RC formula is 89% linear. Attack the baseline used in the system, not the system itself.

Posted 1:34 p.m., November 12, 2003 (#24) - tangotiger
Patriot, I reread your article again. Just an excellent piece!

Thanks for the clarification from the Palmer quote. He makes perfect sense there. I'll guess that David's Palmer quote was probably some "rush statement" he made, similar to stuff James would say in ESPN Chats.

Posted 7:26 a.m., November 13, 2003 (#25) - David Smyth
No, Patriot, the Palmer quote was (probably) in the last version of TB. I'll reread your article to see what he said in the current TB.

Posted 11:09 a.m., November 13, 2003 (#26) - Patriot
Well, the quote I got is from TB #7, which I think is the last. But I could be wrong

If anyone is having trouble finding the Palmer excerpt in my article, it's under the heading "AVERAGE" and in italics.

Posted 3:42 p.m., November 13, 2003 (#27) - David Smyth
By "last" I meant the version right before the current version.

Posted 3:48 p.m., November 13, 2003 (#28) - Patriot
Gotcha.

Posted 3:50 p.m., November 13, 2003 (#29) - Patriot
Gotcha.