Tango on Baseball Archives

© Tangotiger

Archive List

Valuing Starters and Relievers (December 27, 2003)

The Problem with “Average” Pitching Performance
By Guy Molyneux

In discussions of player valuation, there has been considerable debate about the proper way to evaluate the value of relief pitchers. Some systems credit relievers for the greater effect their “high leverage” innings can have on team wins, which is a reasonable approach if one is interested in measuring performance rather than ability. However, most of these valuation systems, including Win Shares (WS), share one feature that I believe serves to overvalue relievers to a considerable degree: using a single runs-prevented performance standard for evaluating both starters and relievers.

Most valuation systems compare pitchers to the average (mean) performance level in their league, or to some definition of replacement-level performance (often tied to the average). WS uses a runs-allowed benchmark that is presumably below replacement-level performance: a park-adjusted ERA 52% above league average. However, while the evaluation benchmark may change from system to system, within any given system it is generally the same for starters and relievers.

The problem with employing such a single standard is that the job of retiring major league hitters for 6 to 9 innings is demonstrably more difficult than the job of retiring the same hitters for 1 or 2 innings. Relievers post better ERAs than starters despite being, on average, significantly less talented pitchers. (In discussing “relievers,” I am referring to all innings pitched in relief, not only closers.) Moreover, those pitchers who have pitched in both roles tend to perform much better in relief.

In other words, it’s not meaningful to merge the performances of starters and relievers to define one arithmetic “average” (or replacement-level) level of pitching performance. Instead, we should measure starters against the average starting pitcher (or more precisely, the performance of a pitcher of average ability when starting), and relievers against the average reliever.

Player valuation systems that fail to adjust for this fundamental difference will give relief pitchers more credit than they deserve for runs prevented during their brief stints on the mound. Of course, any given valuation system could have other components that offset this factor, and so may not actually overvalue relievers. But at least in evaluating runs prevented – the core of pitcher evaluation – any system that uses a single definition of average (or replacement) pitching performance will tend to overvalue relievers and undervalue starters.

How Much Easier is Relieving?

Last year, the average NL ERA in relief innings (4.05) was .36 runs better than the ERA posted by starters (4.41). But this aggregate comparison is only the beginning of the story, because relief pitchers on the whole have less ability than starters (if they could start, most would). The question we need to answer is: How much better can we expect an average pitcher to perform in relief than when pitching as a starter?

I don’t have a precise estimate of the starter/reliever gap, but the answer is clearly “a lot.” Eric Gagne is of course the poster child here, posting an ERA of .171 in relief but more than three runs higher as a starter. Even John Smoltz, a superb starter, never came close to the 2.17 ERA he has posted in relief the last two years. And below the Cy Young level, nearly every season witnesses several pitchers who have floundered as starters make a successful transition to relief (Chris Reitsma, Joe Nathan), while those who try the reverse frequently fail (Terry Adams, Danny Graves).

Tangotiger, in an as-yet unpublished article, estimates the effect of the starter/reliever role at about 0.60 runs in ERA. This strikes me as a plausible though perhaps conservative estimate. It is not uncommon for pitchers to lower their ERA by more than a full run when moving into relief. Over the 2001-2003 period, these pitchers had a S/R gap of more than one full run better as relievers: Alvarez, Brower, Adams, Burba, Reitsma, Shields, Graves, Affeldt, and Halama. However, even if the S/R gap were about 0.60, this would have very significant implications for valuing relievers.

Since we cannot assume that relievers and starters have equivalent ability, I think the best way to measure the S/R gap would be to analyze those pitchers who pitched a significant number of innings in both roles over a given period of time (say three seasons) for the same team. This effectively controls for park effects and team defense. Within such a sample of dual-use pitchers, we could then measure the average performance gap between starting and relieving, using component ERA or some similar metric. We might choose to omit very young pitchers and those at the very end of their career, as their talent level might change dramatically over the course of a three-year period, skewing the comparison if they happened to pitch predominantly in one role while clearly a better pitcher. I’m convinced that such a study would demonstrate that relievers have an inherent advantage over starters of at least 0.60 in ERA, perhaps more.

One could argue that such an approach creates the potential for "selection bias," because many dual-use pitchers have experience in both roles precisely because they failed as starters (but had they failed at both, they wouldn’t stay in the majors at all). If only some pitchers perform better in relief, while many others perform as well or better as starters, a study that disproportionately examined the former group would overstate the S/R gap. However, I see this as more of a theoretical than practical problem.

A study of dual-use pitchers would omit 2 groups of pitchers:

1) Starters only. Is there some reason to think they, unlike dual-use pitchers, wouldn't also perform better as relievers? Based on postseason experience, when quality starters are sometimes called upon to relieve, it doesn't look that way. More importantly, the many examples of starters who begin failing late in their career, only to find success for a few more seasons in the bullpen, provides very strong evidence that switching to the reliever role would improve the performance of most pitchers for whom we have data only in the starting role.

2) Relievers only. Could some significant number of relievers actually be just as good (or better) as starters? This is easier – the answer clearly is no. If these guys could maintain ERAs anywhere near their current level as starters, they would be in the rotation. In fact, this creates the opposite kind of selection bias -- there are probably many relievers for whom the S/R gap would be even larger than for dual-use pitchers, but their inability to start was established in the minors.

On balance, it seems to me the risk of serious selection bias is not that large, and much smaller than the clear error introduced by measuring starters and relievers against a single composite standard. In any case, I hope some analysts with good databases will take up the challenge of measuring the S/R gap, through this and perhaps other approaches.

Why do relievers enjoy such an advantage over starters? Clearly, a pitcher who knows he only has to pitch one or two innings can afford to throw harder, and this leads to more strikeouts. NL starters averaged 6.39 K/9 in 2003, while NL relievers averaged 7.16 K/9, 12% higher. Relievers also yielded fewer HR/9 (1.00 vs. 1.09) and posted a somewhat better hits on balls in play rate (.278 vs. .283). Interestingly, relievers allowed more BB/9 than starters, 3.78 vs. 3.24 – consistent with the idea that pitchers throw harder in relief (and with the premise that relievers are less talented in general). However, these aggregate comparisons provide at best a partial answer, since we don’t know how much of the differences results from the role and how much from underlying differences in ability. A study comparing pitchers against themselves in the two pitching roles would also allow us to better understand the nature of the reliever’s advantage.

Establishing Separate Standards.

To properly value the contribution of starters and relievers, we need to establish separate benchmarks for each. The required adjustment will of course vary depending on the valuation system. To take WS as an example, the main change would be in expected runs allowed. To simplify, assume a park-adjusted average runs allowed of 4R/9IP, that starters account for 2/3 of all IP, and that relievers have an inherent advantage of .6 R/9. Then your benchmarks become 4.20 for starters (or 6.384, using the 1.52 multiplier) and 3.60 (5.472) for relievers. In WS, the same adjustment would be required for assigning claim points to relievers based on performance in high leverage innings.

Needless to say, holding relievers and starters to such different performance standards would result in significant reductions in relievers’ assessed values, while raising the value of starters.

Having separate performance standards will doubtless strike some as unfair, but the reverse is actually true. Why should we compare Armando Benitez’s ERA to a league average driven mainly by starters’ performance, when we know that Benitez could not possibly perform as well over 7 innings, and – more importantly – that the average starter could come much closer to Benitez’s ERA if asked only to face 4 batters at a time? In valuing players we are asking this question: How does this player compare to what an average player (or replacement player) would do in the same situation? We can only do this by developing separate benchmarks for relievers and starters.

Fundamentally, Win Shares and other valuation systems are failing to account for the fact that the quantity of innings pitched – beyond the quality of those innings – has real value for a team. When a GM or manager says “we need a guy we can count on to give us innings” to justify paying $4-5 million/year to a guy with a mediocre ERA, many analysts respond with skepticism if not derision. But in fact, teams do need the innings, so the ability to pitch at a given performance level for 6+ innings is much more valuable-- helps teams to win more games -- than the same performance level in 1-2 inning increments. This is not captured in valuation metrics that treat all IP the same.

Consider two pitchers. Kerry Ligtenberg has a career OPS against of .656, and averages about 60 IP/year, while Curt Schilling has a nearly identical .652 OPS while averaging around 200 IP. Would anyone argue that Curt Schilling's contribution is only 3.3 times as valuable as Ligtenberg's? The salary marketplace clearly says no, and in this case it is right. (Interestingly, if teams were allowed to carry 20 pitchers, baseball might be played differently and perhaps a Ligtenberg would be worth 1/3 of a Schilling. But in the real game, a Schilling is immensely valuable while the Ligtenbergs are interchangeable parts, bouncing from team to team.)

Finally, I want to be clear that adjusting for the S/R gap does not move a valuation system from evaluating pitchers’ performance to ability. We need to make this adjustment to evaluate performance in a meaningful way, because two pitchers of identical ability will perform differently in these two roles. Paradoxically, we need separate standards rather than a single standard to create a truly level playing field for starters and relievers.
--posted by TangoTiger at 11:32 AM EDT


Posted 12:52 p.m., December 27, 2003 (#1) - Anonymous
  .

Posted 1:35 p.m., December 27, 2003 (#2) - MGL
  Good discussion and of course, any system that treats both types of pitchers the same (i.e,, uses the same baseline ERA to compare to) is indeed problematic. A few comments:

One, I think you understate the selective sampling problem in doing a study of pitcher who have substantial relief AND starting work. It can be worked with (adjusted for) to some extent, but it is a big problem nonetheless.

You say that Tango comes up with a difference of around .60 runs after doing his study. You then map out a plan for a reasonable study and state that if you did such a study, you would probably come out with a difference of more than .60 runs. As far as I know, Tango came up with that .60 by doing precisely what you map out as the correct way to do such a study!

The reason you keep giving anecdotal (read: worthless) evidence of large spreads in ERA between relievers and starters is that because relievers pitch between 70-100 innings and starters around twice that. Of course, the lowest ERA's for relievers will be lower than the lowest ERA's for starters over the same time period, even if they had exactly the same talent (true ERA's)!

Your article is good, but please, please (pretty please), DO NOT use anecdotal evidence to "prove" a point or even to support or contradict an hypothesis or a notion or someone else's conclusions! That is wrong and a pet peeve of mine! Unfortunately, many writers do that all the time in the context of shoddy research! In fact, the word "evidence" in the term "anecdotal evidence," is a misnomer! If it were evidence (beyond a scintilla or a de minimus value), it would not be called anecdotal. Whe we provide data from thousands or hundreds of samples, we do not call them "anecdotes"! When we provide one or two (or three or ten) data points, we often call them "anecdotes" (and rightfully so), and of course they have little or no evidentiary value because of the sample size. Anecdotes should only be presented to ILLUSTRATE a contention or a conclusion that is or is going to proven or at least investigated by the proper scientific method! I really don't want to hear about Smoltz or Gagne as "proof" or evidence of anything!

Anyway, very good central point in your article!

BTW, if a pitcher is able to pitch very well for 1 or 2 innings, but he would suck for 5 or 6 or 7 innings for whatever reasons (not enough pitches, stamina, etc.), is that pitcher more or less talented overall than the average starting pitcher? I ask that because you throw out terms like "more or less talented" with regards to starting and relief pitchers without including the proper context (e.g., more or less talented for how manyh innings?)...

Posted 2:41 p.m., December 27, 2003 (#3) - tangotiger
  Excellent article, and thoughts. This is the kind of article I like to read, even if it has no accompanying research.

Posted 4:08 p.m., December 27, 2003 (#4) - kamatoa
  Very interesting argument, but I'm not convinced that relievers and starters should be evaluated separately for the simple reason that a single run given up in a game has the same weight regardless of whether a starter or a reliever allows it.

Because of that, Win Shares and other systems that use a league average benchmark probably are valuing relievers' and starters' contributions toward team wins accurately - it is simply that these systems inherently account for the relative ease of relieving by giving relievers' a relatively greater per-inning score than starters. It would be foolhardy to argue that this difference arises from relievers' greater talent - thus, this difference must arise from the greater difficulty of starting.

To its credit, Win Shares does not indicate that Ligtenberg is more valuable than Schilling - it does, however, note that Ligtenberg contributed to his teams' wins by 1) preventing runs that might have scored and 2) causing outs - the same tasks as required by a starter. In addition, many of Ligtenberg's innings are likely to be "high-leverage" situations, thus, having a Schilling-like reliever on the mound for an inning or two is likely to contribute a great deal to a team's win - perhaps more than a starter's average inning. On the other hand, Ligtenberg's Win Shares can be compared with that of other middle relievers, and Schilling's with that of other starters, to determine which pitcher excelled within their respective roles.

Finally, and perhaps fatally to the argument in the article, is that the sample suggested to study (pitchers who have pitched a significant number of innings as both starters and relievers over three seasons, are not young, are not old, and who have stayed with the same team for the entire duration) is very non-representative of the population of major league pitchers. Frankly, this suggests a study based on failed starters who managed to hang in the majors via bullpen success. Although Guy notes that this might introduce bias, I think he seriously underestimates the impact that such a non-representative sample will have on the data - most pitchers simply do not fit into this category and the ones who do very likely differ in important ways from those who don't.

In addition, this kind of study ignores the fact that a very large number of pitchers likely fit into their current roles "by accident." It is unclear how many relief pitchers would indeed be successful starters if they had an opportunity and had somehow maintained their readiness for that role. Likewise, it is unclear how many failed starters were not given a real chance to pitch out of the bullpen in any case (think Jim Parque). Finally, the entire sample Guy proposed would likely be so small as to call into question any results at all.

So what would be the best tactic to use to determine starters' and relievers' actual usefulness? I would favor the methods used by James and others who use regression analyses to determine the relative contributions of individual players to their teams' performance (i.e., wins, runs allowed, etc.). Although these methods may not necessarily determine which pitcher has more talent, they would determine which pitchers successfully helped their team reached the goal of winning games, regardless of their respective roles on the team. These methods also avoid the issues of sample size and sample bias that would sink a study like the one proposed here.

Posted 7:26 p.m., December 27, 2003 (#5) - David Smyth
  So, Kamatoa, you are advocating a system like ALP? (Even though you've probably never heard of it.)

I recall an old study by P Birnbaum which showed that the "reliever's ERA advantage", as a function of their role and not their talent, is about .2 runs. If the actual is .6 runs+ instead of .2, that says much about the "ability" of these pitchers.

Still, value is what it it is. If you want to determine Gagne's actual impact in 2003, and whether he deserved the Cy Young, you might want to check out the relative ALP. This method is "almost" perfect.

Posted 7:27 p.m., December 27, 2003 (#6) - Anonymous
  .

Posted 9:01 p.m., December 27, 2003 (#7) - kamatoa
  David -

ALP sounds like an interesting system (you're right - I can't say I've heard of it). Where could I learn more?

Posted 11:34 p.m., December 27, 2003 (#8) - tangotiger
  a single run given up in a game has the same weight regardless of whether a starter or a reliever allows it.

So? Don't you value an out from a SS different from a 1B? Don't you value the hitting performance of a pitcher different from a RF?

What we are talking about here is the appropriate baseline for comparison purposes.

The specific question being asked is: "How would a MLB average pitcher do, if he had pitched to those 6 batters in relief?" or "How would a MLB average pitcher do, if he had pitched to those 25 batters to start the game?" Because of the very specific question, you are automatically going to get different baseline standards to compare against.

At this point, I'm not even sure what an average MLB pitcher is, because of this starter/reliever issue. If you take all pitchers, and weight them by their actual PAs, but assume they all started, you might get a 5.00 ERA. If you assume they all relieved, you might get a 4.40 ERA. If you then weight them at 2/3, 1/3, you'll get 4.80 ERA. But, the actual MLB ERA might be 4.70 because of the way the pitchers are used individually.

(Numbers for illustration purposes only.)

Posted 12:36 a.m., December 28, 2003 (#9) - Ken Arneson (homepage)
  I recall an old study by P Birnbaum which showed that the "reliever's ERA advantage", as a function of their role and not their talent, is about .2 runs.

What exactly is meant by a "function of their role"? Is it simply from the fact that relievers sometimes come into innings with one or two outs, while starters never do?

I mean, suppose we have three pitchers in one inning: each pitches 1/3 inning, and each yields a triple, and then gets a strikeout. The first two pitchers end up with an ERA of 27.00, the last one gets an ERA of 0.00, even though they all gave up the exact same sequence of events.

Is that what the .2 difference comes from?

Posted 2:41 a.m., December 28, 2003 (#10) - Charles Saeger(e-mail)
  Very pertinent points made all around.

Posted 9:41 a.m., December 28, 2003 (#11) - studes (homepage)
  Excellent article. I agree with Guy's point, though I also agree with MGL. I cannot think of a way to formulate a study that would not include a signigicant selection bias.

David, no fair whetting our appetite. What's ALP?

Posted 10:08 a.m., December 28, 2003 (#12) - David Smyth
  Studes,it's "Absolute Losses Produced", a companion to AWP (if you can now guess what AWP means, you're a genius). Tango has a link to a fanhome thread somewhere on Primate Studies, I think. I'm not trying to start a discussion on it here. It just struck me that some of Krackatoa's post sounded like what I tried to do there.

Posted 12:48 p.m., December 28, 2003 (#13) - studes (homepage)
  Sorry, David. I should have guessed that.

Posted 2:47 p.m., December 28, 2003 (#14) - Jim R
  Interesting thoughts, and I have a question. At some point does it not make sense to vary the utilization patterns of starters as well as relievers. I'm making a lot of assumptions on Tango's unpublished study, but if the advantage a reliever gets would be based on two factors:
(1) Number of times through the lineup
(2) Amount of exertion on a single pitch

If we can decouple these factors, wouldn't we know about how to maximize starter utilization. For instance, I once read a Mike Marshall linked article where he opines that Maddux will take himself out of games after 3 times through the lineup. That certainly accounts for factor 1, but not necessarily factor 2. If a pitcher does do this, shouldn't they be able to recoup some of the 0.60 advantage in ERA. Could they recoup more if we limit it to two times through the lineup. If we start producing and positing some of these limits, wouldn't a pitcher then be able to also start to increase exertion to recoup a bit more.
Obviously this changes the makeup of your personnel and roster in a drastically different way. However, even if not taking to its logical conclusion, shouldn't we be able to look at an individual group of personnel, determine the equlibrium points of usage, and over the course of the season be able to reduce the staff ERA by a significant amount?

Posted 3:27 p.m., December 28, 2003 (#15) - MGL
  IIRC, Tango found no evidence that "times through the order" is a signiciant factor for anyone (starters or starters/relievers). Therefore the difference between a hybrid's pitcher's stats when releiving versus starting is probably due to throwing harder or at elast differently for 1 or 2 innings when they releive, but pacing themsleves from the getgo when they start.

Or, because of selective sampling it could just be that pitchers who have substabntial time as both starters and relievers, have NO true differences between starting and releiving, but it is just that when a pitchre sucks at starying, he may get demoted to the bullpen and when he pitches well in releif he may get promoted to the rotation. This could easily create the illusion that they are actually better ptichers when they relieve. Here is the proof that could easily be the case:

Assume exaclty the same talent when releiving and whe starting. Assume they are exaclty average pitchers and they are all the same (100 ERA+). 100 pitchers start out as pure starters. 5 will suck to the tune of 1.5 SD's below average. Those pitchers will get demoted to the bullpen. In the bulpne they will have an ERA+ of 100. As starters they will have an ERA+ of something less than 100+. Get it?

Given the way some pitchers get shiffled between the pen and being starters regardless of their true talent (e.g., Weaver), you have all knds of SEVERE selective sampling issues when looking at all pitchers who have pitched in the pen AND inthe rotation. In fact, becuase of the reason why some pticher get swriched to the pen or get promoted to a starter, you conclusion when looking at their relative perforemances is going to be goregone and will NOT give you much insight into their true talents when starting and when relieving! It is exaclty like trying to come up with MLE's that apply to any player in the minors - very difficult, even harder for this pitching thing, as you will be hard pressed to find many pitchers who pitch well as starters and then get "demoted" to the pen...

Posted 3:31 p.m., December 28, 2003 (#16) - MGL
  The more that I think of it, the more that I think it is worthless to compare performances by the same pitcher in both relief and starting roles. Completely worthless! In fact, if it is true that "times thru the order" is not that relevant, that is evidence that maybe there is no signifciant difference between when a pitcher starts and when he relieves, at least for those pitchers who can and do both (as opposed to "specialty" pitchers who can throw 100 MPOH for one inning, like Wagner)...

Posted 4:03 p.m., December 28, 2003 (#17) - tangotiger
  To confirm MGL, yes, I did also find that "times through the order", a pitcher, be it a career starter, career reliever, or something in between, performed equally well each time through the order (NO dropoff whatsoever). So, knowing you are going to start, you can pace yourself.

However, it's certainly possible that a Billy Wagner has no reason to pace himself, and decides to juice it up 2 MPH per pitch. Why not? The difference in starting/relieving can simply be traced down to effort exerted. This may also be the reason that "leveraged innings" for relievers are alot less than starters. That is, a starter will have an LI of 1.0 for 230 innings, or 230 leveraged innings, pitching at 95% of maximum. A reliever will have an LI of 2.0 for 90 innings, or 180 leveraged innings, pitching at 99% maximum.

It could very well be that effort x Leverage x Innings works out the same for both starters and (top) relievers.

Posted 1:00 a.m., December 29, 2003 (#18) - Guy
  As a new contributor, I want to thank MGL for his warm welcome to Primate Studies. Seriously, thanks for the (generally) thoughtful comments. Three basic criticisms have been offered of my contention that relievers have an inherent advantage over starters:
1) Even if it is true, it doesn’t matter for valuing pitchers’ performance,
2) It isn’t true, and any observed difference is likely a function of small sample sizes for relievers and/or selection bias in which pitchers assume the two roles; and
3) It probably is true, but there is no way to accurately measure it.
Let me address each one.

IT DOESN’T MATTER
Tango has already addressed this issue, so I don’t have a lot to add. Personally, I am skeptical of systems that purport to assign absolute value to players without any reference either to average performance or replacement-level performance. This is one of Win Share’s conceits, but I don’t think it really succeeds, and I’m not even sure such a system is possible, or desirable. But in any case, most valuation systems DO compare players’ performance to some kind of benchmark, and this includes important elements of the Win Shares calculation for pitchers. So I think the reliever advantage, if it exists, clearly does have important ramifications for many player valuation systems.

IT ISN’T TRUE.
Two simple facts provide overwhelming evidence of a reliever advantage:
1) The large majority of relief innings are thrown by pitchers who have failed to succeed as starters;
2) Despite fact #1, in 2003 relievers posted a collective ERA 0.38 lower than starters.
The average team had 490 relief innings, and the average closer threw about 68 of these, or 14%. Even if we generously assume that all closers are very talented pitchers (the list includes Jose Mesa, Rocky Biddle, and Mike Williams), 86% of all relief innings were hurled by failed starters. The fact that relievers nonetheless perform better than starters on a per inning basis – collectively, not “anecdotally” – means that the reliever role on average confers a powerful advantage.

MGL hypothesizes that many pitchers get demoted to the bullpen on the basis of a small IP sample, and are then kept there despite the fact they could pitch just as well in a starting role. To begin with, this ignores the fact that pitchers’ entire performance record, including the minors and spring training, inform decisions about who starts and relieves. And think about what this theory means. There are scores of relievers who regularly post ERAs better than the average starter (4.54), and at least 25 teams each year must have a couple of relievers -- excluding their closer -- with ERAs better than the team’s worst starter. We’re supposed to believe that as 30 teams desperately try to find decent #4 and #5 starters, they continually overlook pitchers on their own roster who could successfully start? I recognize that GMs and managers make a lot of mistakes, but this is preposterous. If most decent middle relievers could pitch just as well as starters, many would be given the chance, and many of those would succeed (and the starter/reliever gap would shrink in my proposed dual-use sample).

MGL also suggests that relievers will naturally post many of the best ERAs, because they pitch only 75-100 IP while starters pitch twice that. This is an interesting theory, but a quick review of actual data shows it has little relevance here. Looking at pitchers who logged at least 50 IP last year, among those who had an ERA of 2.50 or lower 18 were relievers and 4 were starters, while among those with ERA of 6.00 or higher just 8 are relievers and 17 are starters (and a handful were hybrids). The predominance of relievers in the list of low-ERA pitchers obviously reflects the fact that the entire curve is shifted for relievers, not sampling error. Clearly, the threshold for continued employment in MLB is higher (i.e. a lower ERA) for relievers than for starters – as it should be.

Given Tango’s findings on times-through-the-lineup, the clear source of the reliever advantage is throwing harder, resulting in far more strikeouts and, perhaps, a lower hit% on balls in play. Last year, relievers struck out 1 more batter per 9 IP than starters (7.07 vs. 6.04), a 17% advantage. If we corrected for relievers’ lesser talent, I would guess that the K advantage in relieving would easily exceed 20%. (I couldn’t find PA totals for starters and relievers, but the K/PA gap is probably even greater).

Perform this thought experiment: imagine that all relievers are required to start 4 games in 2004, and all starters forced to pitch 30 innings in relief. Can any knowledgeable fan doubt that the former would be a disaster, or that the latter would, generally speaking, be a success? At a minimum, we should assume that starters are just as talented as relievers, which creates a presumption that the relief role delivers an advantage of at least 0.30 in ERA, and recognize that the real number is almost certainly higher.

IT CAN’T BE MEASURED.
Several people have suggested that my proposed methodology for measuring the reliever advantage, by comparing the performance of pitchers who have pitched in both roles, is flawed. Sample size would not be a problem, if one assembled data from the past ten years or so. The more serious objection is the potential for selection bias. It is not true that the dual-use sample would comprise only pitchers demoted to the bullpen after failing in only a handful of starts. It would include quite a mix, including solid relievers asked to make some spot starts, pitchers who have performed reasonably well in both roles (M. Batista), good starters who became closers (Smoltz), and a variety of other combinations. Still, we can’t be sure that dual-use pitchers are the same as those who have pitched only in one role.

Let me say first that I hope others will come forward with other suggested approaches for measuring the reliever advantage. This is the best I came up with, but I’d be happy to hear better ideas. And I hope Tango will share his findings soon (I reported his conclusion, but do not know what methodology he used).

That said, I still think such a comparison is valuable, if imperfect. Let’s break down what I’ve been calling the reliever advantage into two discrete, and potentially different, numbers: 1) Reliever Decline -- how much worse the average reliever would perform if forced to start, and 2) Starter Improvement -- how much better the average starter would perform in relief. My proposed study would almost certainly understate the magnitude of reliever decline. After all, to become a dual-use pitcher, a manager had some reason to think he could succeed in the starter role at that time. For relievers who have never been used as starters -- despite posting better ERAs than most starters! -- it is fair to assume that they would see at least as large a decline in performance as the dual-use group, and probably even larger.

Moreover, most relievers probably had some experience starting in the minor leagues. If someone can demonstrate that current ML relievers performed just as well in the starting role while in the minors as did pitchers who went on to succeed as ML starters, I’ll reconsider my position. But I will be shocked if that is the case, for the simple reason that teams desperately want to find quality starters, and would give successful minor league starters a reasonable opportunity to succeed before relegating them permanently to the pen.

So, I think a study of dual-use pitchers would, at least, provide a conservative estimate of reliever decline.

The much harder question is: Would current starters improve their performance by the same amount that relievers decline, if asked to pitch in one or two inning increments? It could be true that the gap would be smaller going in this direction – that what distinguishes starters is their ability to maintain a high performance level for 6 or more innings, but they cannot improve greatly even if asked to pitch fewer innings. And if true, a study only of dual-use pitchers would miss this.

To this objection, I would say two things:

First, that the burden of proof should fall on those who want to argue that starters are a totally different animal, immune from the difference found elsewhere. We have observed that relievers post better ERAs, even though we have good reason to think they are lesser talents. Among pitchers who have pitched in both roles, we expect to find (and maybe Tango has?) that they perform better as relievers. Given that, the burden of proof would shift to those who believe starters who have never pitched in relief form a separate class.

Second, and more importantly, the fact that our best estimate of the reliever advantage may be flawed is not a reason to ignore this difference. If the best evidence we have suggests relievers have an inherent advantage of .6 runs per game, that is very significant. Should we ignore that because we can’t totally rule out the possibility that it is really .45 (or .7)? Should we keep using a single benchmark for all pitchers when we know that makes no sense, because we aren’t sure exactly how different the benchmarks should be? To my mind, this is making the perfect the enemy of the good.

Finally, there are good reasons beyond the issue of player valuation to study and try to measure the reliever advantage as best we can, as well as which types of pitchers enjoy greater or lesser advantages in the relief role. Because if it exists at any level close to .6 runs/game, then it probably has major implications for the development of the game. Such an advantage would mean that a substantially below average starter will generally become an average pitcher in relief, an average pitcher will become a very good one, and a good pitcher will become excellent. And if that is true, we would expect a number of things to happen:

Teams would want to use relievers as much as they could, except when starting the very best starters;
Teams would move in the direction of carrying as many pitchers as they could, within the constraint of the 24-man roster (i.e. until the cost of losing another position player was too great);
If pitchers can throw harder when pitching fewer innings, then having starters pitch every 5th day rather than every 4th may improve their performance, as will having them pitch fewer innings per start;
Strikeouts will increase to the extent that teams pursue some or all of these strategies.

As it happens, this is a pretty good description of changes in pitching use over the past twenty five years or so. In 1968, starters accounted for 75% to 80% of IP for most teams, while today it's 67%. Five man rotations are now universal. Strikeout rates are up dramatically.

I’m not suggesting that the reliever advantage is the sole, or even primary, reason for these changes. But it could well have played an important role in the evolution of pitching staff management. This would be true even if managers and GMs didn’t consciously recognize the reliever advantage concept – simple trial and error response to the increasing advantages held by the offense over the past two decades would have pushed teams in this direction.

Thanks again for the comments. And I do hope others will contribute their thoughts about better ways to measure and understand this issue.

It is true that I take starting pitching to in some sense be the norm, when I say that someone who can pitch well for 7 innings is “more talented” than someone who can perform the same for only 1 or 2 innings. Clearly, the former is more valuable than the latter (assuming a performance above replacement level). I think it’s also reasonable to call that superior talent – certainly, most pitchers start their careers hoping to start – but I can see the contrary view.

Posted 1:04 a.m., December 29, 2003 (#19) - Guy
  Sorry -- last paragraph there was included by mistake.

Posted 4:31 a.m., December 29, 2003 (#20) - MGL
  A thoughtful discussion, Guy. I have to re-read your original article, because I forgot your contention and its ramifications. If one, you are saying that when a starter, on the average, switches roles, he will improve his per inning performance, I certainly don't disagree. Did you think I did? We already discussed and we are all in agreement that a pitcher can choose to throw harder knowing that he is only going to pitch for 1 or or 3 innings, rather than potentially 6 or 7. While some may not be able to do this for whatever reasons, or their style of pitching is not conducive to "throwing harder" (I can't imagine Maddux, for example, throwing harder or altering his style of pitching when pitching in a relief role, but you never know), I think it is fair to say that on the average a starter will pitch better in relief (i.e., some will pitch better, some will pitch the same, and maybe a few will pitch worse - those who take a while to "warm up," if there really is such a thing).

We might have thought that pitching only 1,2 or 3 innings would have been an inherent advantage over pitching more innings, as we might have surmised that the mor times a batter sees a pitcher, the more of an advantage he has. That appears to NOT be the case, based on Tango's research, but you never know. It could certainly be that seeing a pitcher for the second time IS an advantage for a batter, but that advantage is eliminated because, on the average, a pitcher increases his talent slightly after the first couple of innings. In either case, the net result is the same (same result first, second, and third times through the batting order), so it doesn't really matter why, I don't think. So again, we are left with the fact that a pitcher (or at least some pitchers) can and do alter their pitching when they know they are only out there for a short time in such a way that their per batter or per inning "talent" is better than if they anticiopated being out there for a long time perhaps. Makes sense. Probably true. Now, verifying that hypothesis and trying to quantify it is the real tricky part.

Guy, I think you still don't understand the magnitude of the selective sampling problem when you are dealing with dual-use pitchers. It is extrememly problematic. It doesn't take a whole lot of pitchers who get bombed or pitch poorly as starters getting demoted to the bullpen to screw up the randomness and independence of your reliever and starter ERA samples for the dual use pitchers. This can easily be illustrated by doing a simulation on the computer.

The other thing I said is that if a group of pitchers (relief pitchers) have average smaller samples, their best and worst ERA's per season will ALWAYS be more extreme than a group (starters) that has larger average sample size (innings), assuming that one group is not much mor eor less talented than the other. This is mathematical certainty of course. If we took a group of 100 pitchers and threw them 3 innings each for the whole season, and another group of 100 and threw them for 200 innings for a whole season, which group would boast the most pitchers with the most extreme ERA's? It is a no brainer, right! Several of the 3 inning pitchers will have ERA's of 0.00. Several will have ERA's over 10, etc. Of the 200 inning pitchers, none will likely have ERA's of 0.00 or even 1.00 and none would likely have sky high ERA's. Same thing but less extreme results for a group of pitchers who throw 75-100 innings per year (releivers) versus a group of pitchers who throw 150-200 inn. per season (starters). Of couse, if you complicate things by throwing pitchers who pitch in only a few innings into the starter group, then the reuslts are going to change a little. I am not saying that releivers will have more extremem ERA's, which they will, that starters, becuase they are releivers - it has nothing to do with that. I amjut sating that any group of pitchers who don't pitch that many innings will have more extreme ERA's than a group that pitches a lot more innings, regardless of the relative talents of the two groups (assuming they are reasonably close in talent). It is the notion in statistics and gambling that in the short run "fluctuation trumps expectation," when expectation is expressed as a RATE, like ERA, but in the long-run it is the reverse.

That's all I have for now...

Posted 8:55 a.m., December 29, 2003 (#21) - David Smyth
  Very interesting topic. Although it is interesting to try to determine how starters would do in relief, and vice-versa, the fundamental way to handle this problem is simply to figure out what the proper baseline is for any comparison. And that would be the relacement level for innings by starters and innings by relievers. Simply look at the worst 10% (or whatever) of players who keep jobs, regress properly to handle the inning differences of starters and relievers, and voila. It doesn't matter whether relievers are failed starters or not, because when a reliever is replaced, his replacement will also relieve. This is certainly a "value" perspective in a topic with "ability" considerations, but the value perspective will get you 90+% of the way there.

Posted 9:26 a.m., December 29, 2003 (#22) - tangotiger
  The most important takeaway in Guy's last reply is that, overall, the league average ERA of a reliever is lower than that of a starter. Assuming that the component ERA is lower as well (and I only look at component ERA btw), then that by itself is enough to say that "yes, relievers have an advantage". Why? Because your best reliever is worse than your best starter. Your 2nd best reliever is worse than your 2nd best starter, and so on and so on.

I would guess that if you take out the best reliever (BEFORE THE FACT) and your two best starters (BEFORE THE FACT), THEN, I would say that the remaining relievers and remaining starters would be equals. Just a guess.

And MGL: Guy said that while the reliever's top ERAs were much better than the starter's top ERA, the reliever's bottom ERAs were ALSO much better than the starer's bottom ERAs. Not sure if he used the appropriate IP cutoffs. In any case, this suggests that:
a) the spread is probably wider as expected, but
b) the distribution is shifted to one side by a large degree (otherwise you would have expected alot more bad reliever ERAs out there)

However, this last paragraph can be selective sampling. A manager might need (or accept) going to 100 innings with a starter before benching him, and might only accept 25 innings from a reliever at that level.

Posted 10:07 a.m., December 29, 2003 (#23) - kamatoa
  As others have mentioned, this is a very interesting discussion.

I think the most insurmountable problem in Guy's thesis is the sampling issue. However, I would definitely be willing to withhold judgment until seeing the results. If Tango is right when he says that there is no difference in performance based on times through the order, then it would be reasonable to suppose that any difference in the ERAs between starting and relieving in even this highly selective group of spot-starters, swingmen, demoted starters, and stretched-out relievers could be due to the relative ease of pitching from the bullpen. (Although it has just occurred to me that this relative ease of bullpen pitching might be partly due to managerial strategies that enable favorable pitcher-batter matchups - perhaps LaRussa has it right, strategerie could make a difference.)

Tango was somewhat critical of my contention that a single run has the same weight in a game regardless of who gave it up. Although I acknowledge the greater sabermetric expertise of others on the thread, I would simply reassert my belief that pitcher evaluation systems currently in use already measure the relative ease of bullpen pitching (the relative ease of which Guy convincingly argues), since bullpen pitchers receive higher per-inning evaluation scores than starters, despite these pitchers' relatively lesser talent. The marginal per-inning win share for a reliever over a starter probably represents the relative ease of starting, to a significant degree.

A couple of 2003 cases can serve to illustrate (chosen because of their similarity in ERA):

Eddie Guardado: 65.3 IP; 2.89 ERA; 3-5; 41 SV; 15 Win Shares
Esteban Loaiza: 226.3 IP; 2.90 ERA; 21-9; 0 SV; 23 Win Shares

Both pitchers pitched well for their teams and contributed to team wins. Both were rightly considered among the best at their respective roles in 2003. However, Guardado's per-inning win share was .22 (almost exactly the same as Eckersley's MVP year of 1992), whereas Loaiza's Cy Young-contending per-inning win share was .10. Although Loaiza's success in 2003 may or may not have been a fluke (mastering a new "out pitch" may have greatly helped Loiza in 2003), few would suggest that Guardado was more than twice as talented as Loaiza in 2003, inning for inning, especially given their similar ERAs. Since the discrepancy between Loaiza's and Guardado's scores are not likely due to their respective talent in 2003, it is probably somewhat due to the differential difficulty in pitching as a starter vs. a reliever - since making outs as a reliever is "easier," then relievers progress teams more quickly toward wins, on an inning-for-inning basis, and thus earn more win shares due to their role alone.

For a more modest example, take these two pitchers' 2003 stats:
Tim Spooneybarger: 42 IP; 4.07 ERA; 1-2; 0 SV; 3 WS
Adam Eaton: 183 IP; 4.08 ERA; 9-12; 0 SV; 7 WS

Spooneybarger's .071 per inning win share is nearly double Eaton's .038 per inning rate, a difference not likely due to talent. Although Eaton is rightly acknowledged to have contributed more to his team's wins overall, Spooneybarger has the appearance of having contributed more than Eaton in any single given inning, despite their similar ERAs.

Although I know these examples are crude (they do not control for overall team performance and other variables that introduce variance into the win share system), but I think they suitably illustrate that systems like win shares already acknowledge the relative ease of relieving by inflating relievers' per-inning scores relative to starters who post the same ERAs over more innings.

Guy wrote: I think the reliever advantage, if it exists, clearly does have important ramifications for many player valuation systems. I agree with this statement, but comparing relievers and starters on one of these systems, Win Shares, suggests that the system may already acknowledge the benefit relievers receive from their role, regardless of their talent compared to starters.

Posted 10:13 a.m., December 29, 2003 (#24) - tangotiger
  since bullpen pitchers receive higher per-inning evaluation scores than starters

This is not true. The average reliever's Leveraged Index (LI) is 1.0, as it is for the average starter. Obviously, the Riveras and Hoffmans et al are at 1.7 or 1.8, while the Pedros and RJ are still at the 1.0 level.

The LI ONLY serves to establish the relative impact that the relievers have on the game state.

Relievers have it easier, just like catchers have it tougher. I would certainly advocate a different baseline comparison for catchers too.

Posted 10:53 a.m., December 29, 2003 (#25) - Guy
  I like David Smyth's suggested approach, both as a way of measuring the reliever advantage and because it addresses another important implication of the advantage: that there is no such thing as a single "replacement level pitcher," but rather replacement-level starters and replacement-level relievers. Depending on the distribution of the "durability talent," we could find that the gap is bigger (my guess) or smaller at the replacement level than overall.

Another interesting implication is the aging pattern for pitchers. If many starters pitch an increasing number of relief innings in their 30s, while few relievers do the reverse, then this will artificially flatten out the performance curve by age. That is, actual performance of starters may decline more quickly than it would appear by looking at their combined start/relief IP. And the curve may be quite different for staters than for relievers -- it may be easier to maintain performance with age in the relief role than it is as a starter. I would certainly guess that the average age of relievers (weighted by IP) is older than that of starters.

Posted 11:41 a.m., December 29, 2003 (#26) - ColinM
  Great discussion. This site is tough to keep up with if you go on vacation for a bit!

Got a bit of a problem with how you might set different replacement levels for starters and relievers. Say for example, Curt Schilling averages 8 innings a start. Is it right to just compare him to a replacement starter in order to find his value over replacement? There's no way that his hypothetical replacement would be expected to pitch 8 innings. So wouldn't his true replacement be something like 5 IP of starting pitching and 3 of relief?

Same thing might be a problem if you have real good middle relief (think Mark Eichhorn in '86). In a lot of cases the true replacement might have been a starter pitching an extra inning. The replacement line is fuzzy if you want to split it between starters and relievers because there is no rule for what inning to bring a reliever in.

Posted 12:42 p.m., December 29, 2003 (#27) - tangotiger
  Colin, great thought, and I was thinking of something similar.

Essentially, if Curt is averaging 30 BFP per game, then his replacement's performance should be judged against that. While the "performance in times-through-the-order" is fairly static, this really only applies to the pitchers who were able to pitch through the order 3 times. It's possible that a replacement-level pitcher just is not durable/cunning enough to go through 27 batters, and he might pay the price the third-time through (yes, I will look into this as well).

However, I will disagree with part of this replacement talk. There IS a replacement-level TRUE TALENT pitcher. However, how this pitcher performs in context as a starter, or in context as a reliever is in question. I'm not sure that talking the replacement-level reliever and replacement-level starter, from 2 separate pools, is a good way to equate it.

To take an example, you wouldn't take a replacement-level high school SS and compare that to a replacement-level high school 1B as a way to baseline the SS and 1B positions in high school. The pools of position players are very interchangeable in high school, and they are as well with the roles of MLB pitchers.

Posted 2:21 p.m., December 29, 2003 (#28) - Guy
  Perhaps thinking of two separate pools is not helpful. But clearly a better component ERA is required to hold a MLB job in relief than as a fifth starter, and I would expect the difference to be substantial. And when we evaluate starters in comparison to replacement level, the benchmark should be what a replacement-talent pitcher does in a starting role (not mixed with relief performance).

I may be misunderstanding Colin's suggestion, but it seems to me that comparing Schilling to a composite of 5 IP replacement starter performance together with 3 IP replacement relief performance would understate Schilling's value. His benchmark would then be better than just comparing him to the starter benchmark alone, in a sense penalizing him for his durability. This seems particularly misguided when we consider that relievers' performance is presumably enhanced by the days off they are periodically provided by 8-9 inning starts like Schilling's.

Posted 2:55 p.m., December 29, 2003 (#29) - tangotiger
  And when we evaluate starters in comparison to replacement level, the benchmark should be what a replacement-talent pitcher does in a starting role (not mixed with relief performance).

Agreed.

As for Colin's suggestion, this would assume, I think, that the replacement-level reliever is worse than the replacement-level starter (assuming two pools of pitchers).

a better component ERA is required to hold a MLB job in relief than as a fifth starter

Ok, so what do we have? In terms of TRUE TALENT, what would be our best guess as to the order:
1S, 2S, 1R, 3S, 4S, 2R, 5S, 3R, 4R, 5R, 6R
(where S=Starter, R=reliever)

Is this pretty much the "Pitching Spectrum"?

Posted 3:00 p.m., December 29, 2003 (#30) - studes (homepage)
  ...since making outs as a reliever is "easier," then relievers progress teams more quickly toward wins, on an inning-for-inning basis, and thus earn more win shares due to their role alone.

I think you may be mixing up skill and performance. Guardado and Loaiza performed similarly per inning (based on ERA), as did Spooneybarger and Eaton. Guy's point is that the starters of these two matched sets have a better set of skills/ability, and he feels that a value-based system ought to recognize that fact.

I also wanted to make a few Win Shares comments (if I may). The reasons relievers receive more Win Shares per inning are:

- The relative credit for wins and saves (which I've argued should be changed or dropped altogether because it favors relievers).
- The leveraged innings concept (Spooneybarger had 6 holds in 2003, which gave him some leveraged innings).
- The Component ERA (ERC) adjustment. Spooneybarger was also helped by this, as his base stats were very good (42 IP, 1 HR, 32 K's, 11 BB's and 27 H's). Starters don't get a component ERA adjustment.

So relievers earn more win shares per inning because the innings they pitch are more important than a starter's, not because their roles are easier. These are two different points altogether.

I'm torn about the best way to handle this, because value is value, regardless of the underlying skill. On the other hand, I think Guy's point is valid, because a reliever's value is based largely on the way he is used by the manager -- which makes reliever evaluation fundamentally different than any other evaluation.

I tend to think that replacement level is the way to go, though Tango and Colin raise good points. To me, replacement level is an economic concept and not a skill concept -- and the market is speaking pretty clearly about the relative value of starters vs. relievers (see Colon/Foulke).

Posted 3:16 p.m., December 29, 2003 (#31) - ColinM
  But Guy, whether it seems to penalize Schilling or not, isn't that what really happens? If you have to replace Schilling, the replacement starter won't be as durable, so his innings will be replaced by a combination of starter and relief innings. If you want to give extra value to Schilling for saving the bullpen for another day then that is a seperate thing altogether. I don't think you can just compare all of his innings to a starter benchmark and hope it evens out in the end.

Posted 3:22 p.m., December 29, 2003 (#32) - tangotiger
  What if we find that a player, when placed at catcher, losing 10% of his offensive output. What do you do with this information?

****

Again, we go back to our question. I'm going to make this "Tangotiger Question #1", as I ask this question all the time, and I think it's the most basic question that every fan asks: how would an average player do if placed in this context? This usually leads to "Tangotiger Question #2": how much worse would the team be if a bubble player played instead of the average player?

TQ1: So, I think we can all agree that it is much more likely that an average pitcher would perform better as a reliever than as a starter. So, how would an average pitcher do in Prior's place? How would an average pitcher do in Gagne's place? In Dotel's place?

TQ2: the bubble player will suck more as a starter than as a reliever, but maybe not to the same degree. Maybe, he just sucks. A Billy Wagner, who lives and dies on 100% exertion, might not be as effective as a 95% exerted starter. But, a bubble pitcher, just might not have that much of a discrepancy between starter and reliever... sort of a poor man's Greg Maddux or Jamie Moyer... the type of guys who might be equally effective as a starter or reliever.

So, it might be possible that if your baseline is the bubble pitcher, that a reliever and starter should have the same comparison line. It's just that it's easier to leverage Wagner's skills as a reliever than it would be to leverage Maddux's skills as a starter. (Just like it's better to leverage Cameron's skills as a CF than to leverage Rolen's skills as a 3B, even though if they played at some same neutral position, like 1B, they might be equally effective.)

Posted 3:23 p.m., December 29, 2003 (#33) - studes (homepage)
  Perhaps the best way to establish replacement level for starters is per game started, not inning pitched.

Posted 3:42 p.m., December 29, 2003 (#34) - studes (homepage)
  Sorry. I'm a bit out of order. #33 was in response to Colin and Guy.

Tango (#32), are you saying that Guy may have a point for an average pitcher, but that a replacement level per inning might be the same for starters and relievers? If so, how can a measurement system resolve that?

Posted 3:44 p.m., December 29, 2003 (#35) - Guy
  While we like to have an image of "the" bubble player, it is really the average of many different bubble players. One replacement shortstop may play above-league-average defense but bat .200, while another plays below-average defense but hits .260. Presumably, the NET impact on the team's run advantage is the same, making both replacement-level players.

My guess is that Tango's image of a RL pitcher -- the poor man's Jamie Moyer -- is the OLD version of the RL pitcher. The young RL pitcher probably looks quite different: he can throw hard for a couple of innings, doesn't have good control, is generally inconsistent -- and probably performs much better as a reliever. And my guess is that this type of RL pitcher is more common, and would come closer to matching the average RL pitcher. Of course, that's just a guess.

Posted 3:52 p.m., December 29, 2003 (#36) - ColinM
  I think Tangos line of thinking is definitely worth checking out before jumping to two different levels of replacement.

Not sure if I agree with the per game thought Studes. It might be best to start at a seasonal level. In theory, a team's starter/relief innings split will be set at whatever split will maximize the overall effectiveness of its staff, given the players available and the leverage of the situations encountered.

So for any given pitcher, you'd have to figure out how he would be replaced if he couldn't pitch. How many of his innings would be replaced by starters? How many by relievers? How would the bullpen use change?

It's a thorny issue.

Posted 3:56 p.m., December 29, 2003 (#37) - Guy
  Colin: If our replacement scenario is 5 IP for a starter and 3 IP for reliever(s), shouldn't we also ask the question: wouldn't replacement-level relievers pitch even worse if forced (collectively) to pitch 3 innings every day? Perhaps more relevant, replacing Curt Schilling with a starter who went only 5 IP per start would increase the workload of his team's bullpen by about 13% (60-70 IP). This would almost certainly have a measurable negative impact on the bullpen's performance. So no, I don't think you should leave aside the obvious value to the team of Schilling going 8.

Posted 4:00 p.m., December 29, 2003 (#38) - tangotiger
  If someone is going to try to do some research on this issue, it is *imperative* that your classification of a starter, reliever, #5 starter, #5/6 reliever is done prior to the season in question (as is any replacement-level issue).

As well, when we talk about "true talent" we are talking about the *expected* performance in a neutral setting. In the case here, a neutral setting might be two-thirds of PAs as a starter, and one-third PAs as a reliever. And *expected* performance would be actual prior performance, but regressed a certain amount.

Posted 4:05 p.m., December 29, 2003 (#39) - Guy
  Tango (and others): can you post citations or links to what you consider the best current research/thinking on replacement level performance?

Posted 4:13 p.m., December 29, 2003 (#40) - tangotiger (homepage)
  Go to the above link, and read that article. It's the best one-stop-shop place for replacement ideas.

Posted 4:48 p.m., December 29, 2003 (#41) - ColinM
  Oh I agree Guy, we can't just leave aside the question of the value of extra innings by a starter. But it's not answered by comparing to only replacement starter innings.

But it makes sense that there would be a negative impact on the bullpen in the Schilling scenario. Those extra 60-70 innings are going to be thrown by a replacement reliever in theory. So for sure the bullpen suffers.

But look at it this way:
What if instead of Schilling, there were two pitchers, a starter and a reliever, who were exactly as productive as Curt. And what if they worked as a tandem, the starter pitched the first 5 innings and the reliever the last 3, so that at the end of the year they had the exact same combined numbers that Schilling had.

How would you evaluate them over replacement? Shouldn't their combined value be the exact same as Schillings? ( I know that two roster spots aren't as good as one, but thats a different issue).

Posted 9:00 p.m., December 29, 2003 (#42) - ColinM
  BTW,
I think I might be coming off as a bit too critical here and wanted to mention that I do think this was a good article by Guy. Which you can tell by the amount of discussion it generated!

Posted 9:21 a.m., December 30, 2003 (#43) - Guy
  Colin: wasn't offended at all by the very reasonable criticism.

As for your example, let me turn it on you: suppose you had four pitchers (call them "Ligtenberg A", B, C, and D) who cumulatively put together Schilling's exact same stats. Would you value them the same as Schilling? Obviously not -- at that point the cost of consuming 4 roster spots is too huge to ignore. But the cost of 1 roster spot is also real. I guess I just don't agree that the roster spot is a different issue -- it has to be accounted for in valuation.

In any case, whether we define the starter benchmark as 8 IP of a replacement starter or the 5/3 combo, it will probably have a fairly trivial impact on the final definition of our benchmark.

Posted 7:30 p.m., December 30, 2003 (#44) - FJM
  In 2003 the average Team ERA for Starters Only was 4.55. For Relievers, it was 4.11. So the difference was 0.44. Fairly large, yes, although a long way from 0.6. The difference is just under 10%.

But let's dig a little deeper. Let's split the data into 3 groups, depending on whether each team's starters had a low, medium or high ERA. My cutoffs are under 4.00 for the good group and 4.90+ for the bad group. The cutoffs were chosen so that the low and high groups would each have 8 teams, leaving 14 teams in the middle. Now, what are the differences for each group?

The average Starters' ERA for the teams with good starting pitchers was 3.78. The Relievers on those 8 teams averaged 3.68. So here the difference is only 0.10 run, or 2.6%. (In fact, it turns out the entire difference for this group can be explained by just one team, the Dodgers, whose Starters came in at 3.49 while the Relievers were a phenomenal 2.46.)

Moving on to the medium group, the average Starters' ERA is 4.45. Their Relievers come in at 4.30. So the difference here is only 0.15 run, or 3.4%. (For this group, over 90% of the difference can be traced to 2 teams with exceptional bullpens, Houston and Minnesota.)

That leaves us with 8 teams to account for the great majority of the Starter-Reliever differential. And their Starters were indeed awful, posting an average ERA of 5.51, more than 1 run worse than the middle group. In contrast, their Relievers posted a decent ERA of 4.23, slightly better than the middle group. That's a differential of 1.28 runs, 23.2%. Moreover, unlike the other 2 groups, where one or two teams accounted for most of the difference, every one of these 8 teams had a significant Starter/Reliever disparity. All but one of them had a differential of at least one run. For three teams (Anaheim, Milwaukee and Cincinnati) the difference exceeded 1.5 runs.

In summary, for those 22 teams that have decent or better starting pitchers, there is very little difference between their starters and relievers in terms of ERA. For teams with lousy starters, their relievers are indeed much better. Isn't that what you would expect?

Posted 10:13 p.m., December 30, 2003 (#45) - Tangotiger
  Your first statement:

In 2003 the average Team ERA for Starters Only was 4.55. For Relievers, it was 4.11. So the difference was 0.44. Fairly large, yes, although a long way from 0.6.

is good evidence of what is being said. If my pitching spectrum (post #29) holds, we have to accept that the average starter is a better pitcher than the average reliever. So, at the very least, using your above data, you have to bump up the relievers ERA at least .44 just to make them equals. And, a bit more to make sure that the average reliever is worse than the average starter.

Posted 1:10 p.m., December 31, 2003 (#46) - FJM
  To some extent, we're arguing about semantics here. What do you mean by "average"? When you say "the average reliever is worse than the average starter", I interpret that as follows. If you choose the same number of starters and relievers at random, the majority of the starters will be better than the majority of the relievers. I'm sure you are right about that, but that has very little to do with my statistics. You are talking about individuals; I'm talking about teams. The last 2 or 3 guys in most bullpens are marginal, near replacement level pitchers. But they get very little opportunity to pitch, never in high leverage situations. So by looking at teams rather than individuals the importance of these players is greatly reduced.

As I said above, for 22 of the 30 teams the difference between their starters and relievers is only about 3%. To me, that's the norm. And that difference is probably largely attributable to the defect in the way ERA is calculated, charging all runners to the pitcher who allowed them to reach base.

8 teams had very poor starting pitching, as defined by an ERA of 4.90 or more. Together, those 8 teams accounted for nearly 80% of the overall Starter/Reliever differential, 0.34 out of 0.44. 3 other teams (Dodgers, Astros and Twins) account for the rest of it. So what is your norm, the 11 teams who average a 1.2 run differential, or the 19 teams where the average differential is 0?

Posted 2:39 p.m., December 31, 2003 (#47) - Michael Humphreys
  Great article and thread. Continued research into both questions of pitcher evaluation--"ability" and "value" is worth pursuing.

At first, it will probably be easier to refine "value" measurements.

David and Guy are on the right track, though it might be worth considering creating *three* separate replacement levels: starters, closers and middle-relievers. (Or full-time "Average" Leverage, one/two-inning High Leverage and two/three-inning Low Leverage pitchers.)

Replacement value questions are (or should be) "practical" questions--and as a practical matter, no team will allow a replacement-level middle relief pitcher (the bottom-of-the-barrel mop-up guy) assume a "closer" role. Or at least not for a full season.

Determining a third replacement level (for middle-relievers) might also help in the evaluation of the Curt Shillings of the world. By going into the 8th inning, he creates value above the *middle-reliever* level.

Of course, once we start down this process, it leads to Tango's Win Advancement methodology.

Or does it? Tango, does Win Advancements measure pitcher win advancements against league average pitching *across ALL innings* or league-average pitching *during the inning (or inning/base/out) situation" in question? Perhaps sample size issues make the latter measure impractical.

Maybe another way to pose the question is whether separate *Run* Expectations are calculated for each INNING/SCORE/base/out situation, and *then* translated into Win Expectations, or whether Run Expectations are calculated using *all* innings, and then translated into Win Expectations based on the inning/score.

The former approach would indirectly cause a starter in the eight inning to be compared against middle relievers. If we could do that, we'd have a perfectly "granulated" pitcher evaluation system. That is, Curt Shilling's 7th inning Win Advancements would take into account that he is sparing his team the cost of going to middle-relievers, who are generally the worst pitchers on a team. (Of course, Win Advancements measure value over average, but I suppose replacement levels could be determined as well.)

Though probably an even more difficult problem, evaluating pitcher "ability" independent of usage context would be very useful, because it's almost certainly true that teams mismeasure "real" ability and badly misallocate pitching resources. (This in turn results in distorted *value* measurements, because the replacment-level talent pools and their impact on wins is out of whack, so value measurements vis-a-vis such data probably differ from "real" ability much more than happens with batters and fielders).

So really both "ability" and "value" research support each other and are both worth pursuing.

Posted 3:47 p.m., December 31, 2003 (#48) - Tangotiger
  I don't see how looking at teams helps in looking at players in this instance, but let's not beat this dying horse.

***

For WPA, it doesn't matter much what you use, because my consideration is that both teams are equally talented at every point in the game. So, if you've got a 5.5 ERA pitcher, I've got one too. And my offense is as bad as your pitching, so that we're always a .500 team in a 5 RPG environment. This is the easy way to do WPA, and one which I do while I continue having a full-time job.

Posted 3:51 p.m., December 31, 2003 (#49) - David Smyth
  In the 2002 Prospectus annual, Woolnergave formulas for the repl level of starters and relievers:

St: 1.37*LgRA/G -.066
Rel: 1.70*LgRA/G -2.66

Applying this to MLB 2003, and converting into ERAs, the repl ERA for starters was 5.90, while for relievers it was 5.31. The difference is .61. Tango hit the bullseye.

Of course, the repl ERA for closers is a heck of a lot different than for the mop-up man. But those formulas are a good starting point. I think the MLB ERA in 2003 was 4.35. So the repl % for starters is about 74% of avg, and for relievers it's about 82% of avg.

Posted 1:16 a.m., January 2, 2004 (#50) - Rob Rolek(e-mail)
  Free Outs, so don't use ERA:

Ken Arneson mentioned this above but I thought I'd expand on it. Any reliever who enters the game in the middle of an inning is at a big advantage. If there are any outs he gets Free Outs. It's a lot easier to give up 0 ER if you only need 1 or 2 outs instead of 3. Even if there's nobody out, if there are men on base, those baserunners can't hurt you but can help you w/forceouts, DP's, baserunning outs, while if they score they don't touch your ERA. So, we shouldn't use plain old ERA when evaluating relievers. I don't think the effect of Free Outs is negligible, but I have no data to back it up. I wonder:

1. How much of the ERA difference do Free Outs accounts for?
2. Someone said they assume reliever Component ERA is better than starter Component ERA. Is it? Does anyone have the data?

Thanks for the great discussion!

Posted 7:43 a.m., January 2, 2004 (#51) - Guy
  David: I just checked the Woolner formula using the 2003 MLB ERA of 4.40. They would give you replacement ERAs of St 5.96 and Rel 4.82, a difference of over 1 R/G. Is there an error in the formulas you posted, or is the difference actually greater than .6?

Posted 9:05 a.m., January 2, 2004 (#52) - David Smyth
  Rob, I think the 'free outs' factor is worth about .2 runs

Guy, the formulas appear to be correct to me. I did make a calculation error, though. Using R/G, as the formula does, I now get 5.36 ERA/4.95 ERA (St/Rel). Using instead R/9IP, I get 5.41/5.02 (since games did not avg exactly 9 IP). Anyway, the difference is now about .4 runs, which is about double the free outs factor.

Posted 9:15 a.m., January 2, 2004 (#53) - David Smyth
  BTW, Guy, I get a MLB ERA of 4.35 instead of 4.40. It looks like you simply averaged the 2 lgs ERAs together, while I combined the totals and then computed the overall ERA. Since the NL plays more games, the ERA gets weighted down a bit.

Posted 10:04 a.m., January 2, 2004 (#54) - Guy
  Sorry to belabor this, but I don't follow. At BP Woolner has RA/9IP for ML at 4.77 for 2003. Using the formulas above:
St: 1.37*4.77 -.066 = 6.47
Rel: 1.70*4.77 -2.66 = 5.55
Of course ERAs will be a bit lower, but the issue is the gap. What am I missing?

Posted 10:24 a.m., January 2, 2004 (#55) - tangotiger
  I think DAvid meant to say -0.66 and not -.066

Posted 12:05 p.m., January 2, 2004 (#56) - Guy
  Thinking about this some more, it's not clear that this comparison captures the inherent reliever advantage. I assume Woolner is measuring the bottom 10% of starters and relievers, or something similar. But the worst #5 starters are presumably better pitchers than the worst middle-relievers. Tango suggested this spectrum above:
1S, 2S, 1R, 3S, 4S, 2R, 5S, 3R, 4R, 5R, 6R
Assuming it's something like that, comparing RL starter to RL reliever will considerably understate the reliever advantage. Put another way, a replacement-level starter would perform better in relief (on average) than a replacement-level reliever.

Posted 4:22 p.m., January 2, 2004 (#57) - David Smyth
  Thanks for the correction, Tango. I looked twice and I still missed it.

As far as what Guy is talking about in his last post, it seems to be similar to what (I recall) Tango saying earlier--that the .4 run gap should be "bumped up" because the avg starter has more ability than the avg reliever. You can do that, but there is another viewpoint which seems to be equally compelling, IMO

And that is, if an adjustment has no impact in the real world (but is true in theory), then it may be counterproductive to apply it. If avg starters are better than avg relievers, and IF this fact makes them more scarce and therefore of higher value, then we should expect to see a wider difference at the repl level than at the avg level. If the gap is the same (.4 runs), then I think that means that means, according to the laws of supply and demand, that avg starters are no more valuable than avg relievers (everthing else being equal, of course). So I would be hesitant to bump up anything.

On a related note--Woolner did find that the repl/avg gap was different for 2 positions. For 2b,SS,3b,LF,CF,RF, the repl is 80% of the avg offensive production at the position. For 1b it's 75%, and for C it's 85%. It seems to me that 1b has such good hitters that it's hard to find suitable repl hitters. And that it's relatively easy to find repl hitters at C (relative to the avg hitting C) because C are pretty big and strong (unlike SS, who are also poor hitters). Anyway, it seems to me that player valuation systems should take Woolner's findings into account.

Posted 5:52 p.m., January 2, 2004 (#58) - Guy
  David: I think we're back to Tango's point that there's just RL pitchers, not two pools of starters and relievers. A #6 reliever is a RL pitcher -- what you can generally find available. But the worst #5 starters are better pitchers, and are not freely available in the same way. They have been selected for that role precisely because they are better than most of the team's relievers. Suppose you calculated a RL closer, by looking at the worst pitchers used in at least 10 save opportunities. This would be something comparable to R2 or R3 -- perhaps even a better pitcher than R5 -- but clearly not "replacement level" in the usual sense. You're doing the same thing here.

To know the reliever advantage, we have to know -- or estimate -- how the RL reliever would perform in a starting role. And we can be confident that would be worse than Woolner's "RL starter," or else he'd be starting!

Posted 7:30 p.m., January 2, 2004 (#59) - David Smyth
  I am responding quickly here, without having really thought thru your last post, Guy. But I am saying that yes, in terms of ability, you are probably right. But my point was that, if the supply of players is such that differences in abstract ability do not result in any scarcity of players to fill positions, then it might be "much ado about nothing". to quoth Shakespeare. If you give an avg starter more credit because he presumably has more ability, but is just as easy to replace equivalently, I don't think that is a good tradeoff. Obviously I am just talking about a generic starter vs a generic reliever, but wnen you start talking about a #3 starter vs a set-up reliever (or whatever), then things change. I would certainly not opt, say, to evaluate Gagne's 2003 season against a marginal closer (who is still probably a avg ERA pitcher). But I think that there are some systermatic differences between starters and relievers that I would make at least that separation in an analysis.

Posted 11:43 p.m., January 2, 2004 (#60) - Guy
  I'm at a disadvantage, not knowing Woolner's methodology. But my point is precisely that the worst 5th starters -- if that's what you mean by a "replacement level starter" -- are indeed more scarce than the worst relievers. They are in fact the 6th or 7th best pitcher on most staffs. I think you would find they are paid considerably more than the worst relievers, and probably out-perform them when they are called upon to pitch in relief (or when the relievers are forced to make an occasional start). IOW, the worst starters are NOT replacement level pitchers; replacement level pitchers aren't permitted to start.

And given this disparity in talent between the worst starters and worst relievers, I find a .4 ERA gap entirely consistent with the idea thar relievers enjoy an advantage of .6 or more (given two pitcher of equal ability).

Posted 6:58 p.m., January 3, 2004 (#61) - David Smyth
  All I am saying is that, the "average" comparison level does not really tell us much about supply and demand at a given position. And the supply and demand factor tells us, essentially, whether we should be focusing more on ability or value. The supply and demand factor for starters and relievers suggests that it is about the same (according to Woolner's formulas), so that it doesn't really matter whether the avg starter is better than the avg reliever. If you are contemplating switching a specific player from starter to relief or vice-versa, you also have to take into account the expected replacement in these scenarios. And if Woolner is correct, the supply of players is such that you can "ignore" the underlying differences between avg starters and relievers.

Posted 9:02 a.m., January 4, 2004 (#62) - Guy
  I must not be making my point clearly. I don't believe that the scarcity of 5.36 ERA starters and 4.95 ERA relievers is the same. Do you? There are plenty of 5.36 starters out there making good money, well above ML minimum. Cory Lidle (5.75 ERA) just signed a $2.75M contract -- think any 4.95 relievers have done the same? If a pitcher only manages a 4.95 ERA in relief, pitching 1-2 innings at a time, he won't keep a job very long (i.e. he is a true replacement level pitcher). These are NOT equivalent pitchers, in terms of scarcity, ability, or anything else. Does Woolner present evidence to the contrary?

Posted 9:25 a.m., January 4, 2004 (#63) - David Smyth
  The scarcity argument does not apply to repl starters and repl relievers, both of whom are presumed to be somewhat freely available, almost by definition. It applies to avg starters and avg relievers. If your team's avg starter has an ERA which is .4 runs higher than your team's avg reliever, and they both go out with injury, these results suggest that the difference in their replacements will still be about .4 runs. That starters are more valuable than relievers because they pitch more innings has nothing to do with this. Your C Lidle point is essentially a suggestion that Woolner's formula, developed over many years of data, does not apply to the present day. I have no idea if you are correct about that, but I suspect not. And I suspect that Lidle's 'ability' is much better than one bad performance sample of a 5.75 ERA. Also, the repl level ERA should he worse in the AL.

Posted 1:39 p.m., January 4, 2004 (#64) - tangotiger
  both of whom are presumed to be somewhat freely available, almost by definition

I don't get this. The replacement level starter is your #3 or #4 reliever. The replacement level reliever is your best pitcher in the minors, or some crud that was just released.

The value of the avg starter v avg reliever has nothing to do with the IP numbers. The avg starter is simply a better pitcher, on a per PA basis, than the avg reliever.

In the minors, the #5 starter is better than probably even the #1 reliever.

I think alot of the discussion so far assumes, from both sides, things that the other side doesn't.

Posted 5:28 p.m., January 4, 2004 (#65) - AED
  David, the scarcity argument does apply. A replacement level reliever can be picked up off the scrap heap. "Replacement level starter" is a misnomer because he is not freely replaceable (because he has value as an above replacement level reliever).

Posted 7:51 p.m., January 4, 2004 (#66) - David Smyth
  Fine, I'm open-minded. But show mw the evidence that a "Replacement level starter is a misnomer because he is not freely replaceable." The Woolner evidence suggests otherwise. I am aware of all the switching that goes on when teams convert a "failed" starter to reliever, etc. And I have no problem with the idea that an avg starter has more ability than an avg reliever. But if the difference between an avg starter and an avg reliever is .4 runs, and the difference (in practice) between a repl starter and a repl reliever is also .4 runs, then I conclude that the entire supply and demand mechanism for pitchers, including the shifting of pitchers from starter to reliever and vice-versa, is doing an effective job of making an avg starter and an avg reliever (ignoring the IP differences and the leverage differences) of equal net value over repl, in the real world.

Posted 1:17 p.m., January 5, 2004 (#67) - AED
  Nobody around here seems to know what Woolner did, so I can't comment on his evidence. I think most of us assume he looked at average ERAs for "#6 starters" and "#8 relievers" (or something like that), defined those as "replacement level", and evaluated those averages as a function of league ERA and came up with his equations. If this is indeed the case, it fails to address the problem that a team's #6 starter is probably its #3 reliever, in which case you are not talking about a replacement-level pitcher but rather someone better than replacement level.

Posted 1:32 p.m., January 5, 2004 (#68) - Guy
  Yes, that's what I've been trying to say. You can't select a subset of players based on ability, then measure the worst of that group and proclaim it "replacement level." Then a "replacement level closer" might have an ERA of 4.00. A "replacement level starting 1B" (500 PA) might have an OPS of .800, and a "replacement level All-Star 1B" an OPS of .860. And in none of these cases would "replacement level" be meaningful. Taking "starting pitchers" as a group is just a less extreme version of the same thing -- they are selected to start because they are better than replacement level.

Posted 3:36 p.m., January 5, 2004 (#69) - David Smyth
  You know, the problem here is that we are simply trying to answer questions which are somewhat different. Therefore, I think we are all basically correct (relative to question), and I guess it should just be left at that...