Tango on Baseball Archives

Silver: The Science of Forecasting (March 12, 2004)

Nate does a pretty good job of describing forecasting. Good graphs always add alot to an

article.

See my comments in post#1.

--posted by TangoTiger at 10:16 AM EDT

Posted 10:17 a.m., March 12, 2004 (#1) - tangotiger
Let me touch on one point that I would want substantiated. This would be a good exercise for you the sabermetricians in the group.

Here's what Nate says at the end:
PECOTA accounts for these sorts of factors by creating not a single forecast point, as other systems do, but rather a range of possible outcomes that the player could expect to achieve at different levels of probability.

Let me just say that all forecasting systems implicity does this. It would be impossible to project a player as getting a .360 OBA for next year. A forecaster is really giving an implicit range where this .360 is the mean (or possibly median) of that distribution.

Instead of telling you that it's going to rain, we tell you that there's an 80% chance of rain, because 80% of the time that these atmospheric conditions have emerged on Tuesday, it has rained on Wednesday.

Nate should be applauded for actually being explicit with his range. Saying that Hubie's OBA will be .360 with 1 SD = .020 is clearer than saying that his OBA will be .360. We all know that the forecaster in the second case isn't trying to pinpoint it exactly. But, in the first case, at least he's trying to give you enough information for you to figure out his probability range.

Surely, this approach is more complicated than the standard method of applying an age adjustment based on the 'average' course of development of all players throughout history. However, it is also leaps and bounds more representative of reality, and more accurate to boot.

What Nate is showing is not "more representative of reality", since no one should think that a forecaster's mean has an SD = 0. What he is showing is something explicit. However, is what Nate showing something that is peculiar to how his engine works? Or, is it simply something implicit in all systems that he's showing explicity?

Now, here's where I'd like to challenge Nate's system. I would say that all players with a similar number of PAs over the last three years (say 1800 - 1900 PAs) with around the same overall talent level (say somewhat above average), but with widely varying set of skills (say like Jeter, Magglio) should have a similar range of forecasts. That is, the range that is being provided is almost entirely based on the number of PAs.

So, for the sabermetricians in the group: check it out. Can the forecast for the range for OBA, HR/PA, BB/PA be almost completely derived from the number of PAs? Or, is the profile of the player important to establish this range?

And look at pitching too. Is the range of ERA for pitchers with similar past IP, but different K/BB ratio and K/IP rates the same or different?

Posted 11:30 a.m., March 12, 2004 (#2) - Nod Narb
How useful are similarity scores in predicting future development? Has anyone ever studied this exclusively? It seems to me that with advances in technology, training, nutrition, and [cough] supplements, there isn't much usefulness in comparing a modern day player to a player who played 30, 40, 50 years ago. At least to predict development.

I don't have BP 2004 with me, so I don't remember which 10(?) criteria they use for the basis of their comparisons. I know walk rate, K rate, etc. are included, so maybe I am way off - maybe these types of indicators transcend era. Thoughts?

Posted 11:38 a.m., March 12, 2004 (#3) - tangotiger
I published aging patterns based on 1979-1999 data, and 1919-1999 data. They were the same.

Whatever advantages the players have today with regards to nutrition/exercise carries itself to all ages.

Posted 12:14 p.m., March 12, 2004 (#4) - cheng
Nod Narb - I don't have the book in front of me, but I think the physical characteristics of height and weight are the 7th and 10th factors on the similarity scoring for hitters. I believe PECOTA implicitly assumes that the differences in training and nutrition would translate to on-field performance, which account for the other 8 factors. I think height is a little more important for pitchers (fifth?) but weight comes in last again.

Also, all rates are normalized by Nate's use of percentiles - a 1-1 BB/K ratio for a hitter in the 1950 NL would be 55th percentile or so, while the same ratio in 2003 would be much higher.

This is nitpicking, really, your larger question about the predictive value of similarity scores is the important issue, and has not been empirically confirmed for PECOTA. I like tango's suggested approach of using the odd years as one data set, coming up with the system, and testing the system on the even years as a sanity check. As far as I know, this has not been done.

Posted 12:47 p.m., March 12, 2004 (#5) - J Cross (homepage)
I made the calculation in the homepage link to see if there's some trend to who PECOTA favors compared to ZiPS. I'll make a similar comparison for pitchers. I also want to sort pitchers into "high K" and "low k" groups (since K's are the first determination of comparable for pitchers) and compare their pecota/zips projections and then return to them at the end of the year.

Posted 1:09 p.m., March 12, 2004 (#6) - tangotiger
J, good stuff!

Posted 3:13 p.m., March 12, 2004 (#7) - studes (homepage)
Tango, that's a great question. My gut tells me that it does matter -- that you do want to forecast similar players based on components. I think this is particularly true of pitchers, and somewhat true of hitters. And I might wonder if there is some cross-impact between a certain skill set in a certain ballpark, when forecasting players.

OTOH, I don't think the components have very much impact, and the added accuracy may not be worth the added work and complexity.

This was a nice article. Nate is the best at using graphs in his work. Well, second best...

Posted 3:22 p.m., March 12, 2004 (#8) - tangotiger
Actually, I didn't ask if using components is important for the forecasts... they are.

I'm asking if the RANGES of the forecasts for a player are any different among players with the same number of PAs. If the range of ERA forecast for Pettite is 2.5 to 5.5 and Mussina is 2.0 to 5.0 and Pedro is 1.5 to 4.5... well, you see what I mean? It doesn't matter what kind of pitcher you are, everyone with the same number of past PAs will get the same range.

(I have not looked into it. This is what I'm asking for others to verify.)

Posted 5:41 p.m., March 12, 2004 (#9) - Nod Narb (homepage)
Ron Shandler also has an article about forecasting (see homepage)

Posted 8:54 p.m., March 12, 2004 (#10) - Ex-Ed
Tango, we don't know enough about how PECOTA works to answer the question. The number of PAs should affect the forecast standard errors for the usual reasons, but imagine that the models for the four different player types shown by Nate vary in terms of their predictive power and hence by the standard error of the esimtate. If so, then the forecast standard errors will vary by PA *and* by player type.

But we really don't know, since PECOTA is a black box.

Posted 3:38 a.m., March 13, 2004 (#11) - MGL
Surely, this approach is more complicated than the standard method of applying an age adjustment based on the 'average' course of development of all players throughout history. However, it is also leaps and bounds more representative of reality, and more accurate to boot.

This is an unbelievably presumptuous statement, especially the last sentence. How in the world do we (or even they) know that it is "more accurate" or "leaps and bounds more representative of reality?"

There is something insidious about making claims or even writing "academic style" articles about a methodology that is a "black box." I'm even tempted to say something like "I think your forecasting method stinks," and if they want to prove it doesn't (stink) they need to put up (tell us the methodology) or shut up.

You simply can't defend something when you won't tell anyone exactly how it works. And as I said, I sincerely don't think you should be making too many claims about how great something is when no one knows how it works. Not only do we have an absolute right to be skeptical, we have an absolute right to reject those claims summarily.

Given that, it sure sounds like Pectoa does one hell of a job in its forecasting methodology, but we have absolutely no idea whether it does what it says it does, or whether any of their impressive claims even has any merit.

I am especially skeptical of the whole "similarity score" thing. One, once you start using 5 or 10 or 15 "similar" players in order to forecast a player's future performance, you run into huge sample size problems. Kind of like their 6'4" catcher study (on Mauer). That was indeed a ridiculous assertion they made about Mauer based on a handful of other 6'4" catchers (and not 6'3" ones, or 6'5" ones, if there were any), as Gleeman properly pointed out. The problem with their "conclusion" was threefold - one, the small sample size, two, the elimination of other useful data (e.g., 6'3" catchers), and three, the assumption that height means anything at all in terms of a catchers career value (maybe it does and maybe it doesn't). Getting back to their "similarity score" methodology, the second problem is that, do we know that a player's projection is a function of the "type" of player he is, independent of what an average overall player with the same historical stats would be projected at? Before I started sacrificing sample size and used similar players to do my forecasting, I would want to be darn sure that "player types" are significantly related to a player's projection notwithstanding his historical stats and the usual Marcel-like projections. I have never seen anyone show that to be true, which is why I think the statement that "it is also leaps and bounds more representative of reality, and more accurate to boot" is incredibly presumptuous...

Posted 7:37 a.m., March 13, 2004 (#12) - David Smyth
---"One, once you start using 5 or 10 or 15 "similar" players in order to forecast a player's future performance, you run into huge sample size problems."

According to Silver, "Pecota uses as many as 100 comparables in generating its forecasts." I wish he had also given the "average" number of comparables used.

---" I would want to be darn sure that "player types" are significantly related to a player's projection notwithstanding his historical stats and the usual Marcel-like projections. I have never seen anyone show that to be true"

For me, that's not the real problem. I'm willing to accept, based on logic alone, that similarity has an independent impact. But what I want to know is, what is the best way to use that insight? Do you make similarity the "focal point" of a method, or do you perhaps just factor it in to a Marcel-type system, with a certain "weight" given?

Posted 8:32 a.m., March 13, 2004 (#13) - mathteamcoach
J. Cross,

What are the chances that we can get Marcel in on your comparsion of ZiPS and PECOTA?

Posted 9:43 a.m., March 13, 2004 (#14) - Ex-Ed
I've beefed on the similarity issue before, but one of the many things we haven't seen is that the model works better by using similar players. That is a strong assumption, and by definition, non-similar players give you information about each other.

Say you have one independent variable, e.g. By selecting similiar players, you are selecting on the independent variable, restricting the range of your x's, and lowering your r-squared or your s.e.e., or whatever fit statistic you care about. That's research design 101.

Posted 10:43 a.m., March 13, 2004 (#15) - Wally Moon
#11: Do you make similarity the "focal point" of a method, or do you perhaps just factor it in to a Marcel-type system, with a certain "weight" given?

Best I can figure out, it is how the similarity scores are determined that is the main claim to innovation in PECOTA -- where the "secret formula" is. About all we know about the specific method used is in that BP2003 chapter by Silver, with the comparisons between PECOTA and Bill James's method. It is this part of the method that is the "art" or invention of PECOTA and the part that Silver is least willing to share. If anybody else knew how he selected the comparables -- the particular weights given to the 10 factors or variables -- and how he used the similarity scores to make the projections then they'd be able to replicate his predictions.

The "Science of Forecasting" essay uses 4 archtypes as illustrations. Do we know how many are used in PECOTA? I have a hunch he's got a bunch of them, perhaps a series of gradations on multiple dimensions, rather than any fixed number of categories. The 4 types given in this article are just for illustrative purposes, not to say how PECOTA actually operates.

On the number of comparables, I'd be interested in how reliable and accurate the performance predictions are as a function of the number of comparables. Is PECOTA less accurate when there are fewer comparables or when the similarity scores are lower? It would seem likely, but would be interesting to know.

I asserted before that nobody should expect Silver to divulge his "formula." (And the core of that formula is in the determination of comparables.) The proof of the system is in the pudding, not the recipe. If anybody can come up with better predictions of performance, they should do so.

Silver is right that you don't see confidence levels (whatever process may be generating them -- simply N-size, or something more) reported in most baseball forecasting systems. But you'll notice that when he does his comparisons of PECOTA with the competition, he still has to revert to his weighted means to say anything useful about PECOTA's predictive validity vs. the competition.

All this said, I thought this was a well done article, one of the best in the "BP Basics" series.

Posted 10:46 a.m., March 13, 2004 (#16) - tangotiger
The way I se Silver doing similarity is: take the 100 most similar players, giving them a certain weight (maybe 10 times more weight for the most similar down to 1 weight for the 100th most similar). Then, take each of those players, and find THEIR most similar players.

That's one degree of separation, and you might end up with say 1000 unique players, each weighted between 1 and 30 (numbers for illustration only).

At this level, sample size might not be an issue. However, before you do discard the 10,000 hitters in favor of the 1000 hitters that might be more represenatitive, you have to establish if being more representative using this model does contain extra information.

It's an interesting process to go through (I'm not even sure this is what PECOTA does). We should remain to be skeptical about any grandeur claims that can't be reproduced. Chris Kahrl said so just last week.

Posted 12:48 p.m., March 13, 2004 (#17) - tangotiger
Here, this is what I'm talking about. Let's look at some pitchers from the Angels:

Pitch 90th 50th 10th
Colon 2.83 3.71 5.34
Ortiz 3.09 4.55 6.92
Washb 3.4 4.16 5.75
Sele. 3.73 5.3 8.08
Lacke 3.08 4.18 5.27
Esoba 2.62 3.97 5.14

This is the ERA forecasted by Nate. I put them roughly in order by experience.

Resetting the above numbers relative to the 50th percentile and we get:

Pitch 90th 50th 10th
Colon 76% 100% 144%
Ortiz 68% 100% 152%
Washb 82% 100% 138%
Sele. 70% 100% 152%
Lacke 74% 100% 126%
Esoba 66% 100% 129%

ALL 73% 100% 140%

There's not really much to glance from here in terms of my expectation that the more PAs a pitcher has had, the narrower their bands should be.

I don't understand why Ortiz' band would be so wide. Why do we know less about Ortiz than Lackey?

There's alot to talk about here with regards to these bands. Their reliability has also not been established. It's a good "to be" model (to use a corporate america term). But, we can't yet say that we are there yet.

Posted 4:21 p.m., March 13, 2004 (#18) - MGL
There is one thing that I would also like to know. Of what practical use is knowing the "reliability bands"? I don't mean to imply that there are none, only that nothing obvious jumps out at me.

Take the most extreme cases. Player A (a pitcher) has no history so we assign the league mean (or league rookie mean or whatever) to his 50th percentile. Say that's a 4.50 ERA. His "band" would simply be the distribution of true talent among players with no history (again, say rookies), whatever that might be. Let's say that Player B has the same 50th percentile, but that he has lots of history such that his band is narrower. Which player would any given team want?

What about the same 2 types of players, but player A has a little history and that history is fantastic, such that his 50th percentile is 3.50, and player B is a veteran with a 3.50 mean projection. Again, of what practical importance are the "bands" to any given team?

Posted 4:23 p.m., March 13, 2004 (#19) - MGL
Tango, I would guess that the "similar players" have a lot to do with the "bands" such that it would not be obvious what those bands will be just by looking at historical playing time and projected playing time...

Posted 4:35 p.m., March 13, 2004 (#20) - J Cross (homepage)
What are the chances that we can get Marcel in on your comparsion of ZiPS and PECOTA?

very good, that should be no problem, mathteamcoach.

Posted 10:46 p.m., March 13, 2004 (#21) - Wally Moon
"Take the most extreme cases. Player A (a pitcher) has no history so we assign the league mean"

But of course that's not what PECOTA would do. It would use translated minor league (or international) experience to choose comparables instead of assigning the league rookie mean and variance.

Posted 11:21 p.m., March 13, 2004 (#22) - tangotiger
The most extreme case would be a player coming straight from college (and assuming tha comparables are set only starting from the minor leagues, and draft position is not used).

Posted 2:35 p.m., March 14, 2004 (#23) - tangotiger
I just picked out 2 relievers:

Donne: 1.89, 3.18, 4.89 (59%, 154%)
Perci: 1.95, 3.47, 4.88 (56%, 141%)

That spread, relative to the starters, just looks wrong. I will guess that the spread is not based on probability distributions based on # of PAs (as it should be mostly), but mostly on the comps. And, I would guess that you are going to get weird results like this.

The fewer the # of PAs, the larger the expected spread of performance. We're getting that, but just barely. When I get a chance tomorrow, I'll show what we think the spread should be.

Posted 8:49 p.m., March 14, 2004 (#24) - Ex-Ed
Regarding similiarity scores, we know that PECOTA uses ANOVA to identify predictive components (BP 2003).

How it uses ANOVA or why it uses ANOVA rather than factor analysis, cluster analysis, MDS, or other more tradtional techniques, we know not.

Whether player i's comparables are weighted by similarity to i in making the forecasts, we don't know. But it has always seemed to me that the knife edge assumption of comparable/not comparable is a very strong one.

Posted 11:58 a.m., March 15, 2004 (#25) - tangotiger
Suppose you have a pitcher that you "know" his true talent OBA is .340. What do you expect his ERA to be? A quick short-hand would be to do: OBA^2 * 37. So, that's 4.28. It's not too important that it's 37 or 38, or 1.8 instead of 2, etc. This is just a nice quick and dirty way.

Now, what if this pitcher is going to face 1000 batters? What do we expect his ERA to be? In this case, we are 95% sure that it'll be between 3.56 and 5.06, or 83% to 118% of his true talent ERA.

What if instead, he will face 300 batters? In this case, the spread would be 70% to 135%. So, we have a substantial difference in our expected ERA based strictly on how much playing time the player will get.

Remember, we started off "knowing" his true ERA. For a starting pitcher, we are more certain about this than about a reliever (because a starting pitcher will have faced 2500 batters in the last 3 years compared to the 800 that a reliever would face). If you add that level of uncertainty to the true era, that widens the gap even more.

Therefore, if you want to verify the reasonableness of the PECOTA probability ranges, just compare the forecasted range for the starters and relievers. You should find a substantial difference between the two.

(Note also that the ERA itself is also subject to uncertainty, because of the "stringing" of hits and outs, along with the fielders impact on BIP).

Posted 12:53 p.m., March 15, 2004 (#26) - Wally Moon
I think for this kind of analysis you somehow have to address the selection bias and simultaneity problem. If a pitcher starts out substantially underperforming his "true" ERA or perhaps his "expected" era based on Marcel, then he's likely to get less playing time. If he outperforms his true ERA he's going to get more playing time. (The same argument would apply to position players.)

A player on a downward spiral within a season and across his career gets less and less playing time, and then unless he gets on a lucky streak he may find that there is no way to get better because he's sitting on the bench. (So he may be traded to a place where his value to the other team exceeds his value to his current team, or he may be demoted to AAA to get more playing time, etc.) And of course, in some cases sharply reduced playing time is associated with injury of some kind, which is just an extreme case of deteriorating productivity (or of unavailability to produce at least in the short run but possibly for a long time or forever--with end of career injury). But in a more typical case the deterioration could just be the result of "aging" or other factors, not least of which might be who else is on the team -- a player performing well below expectations can usually be "replaced," and of course eventually all players are replaced and their production is reduced to zero.

In any case, variance in performance has a reciprocal relationship with playing time (as well as, I would speculate, with under or overperforming "expectations," i.e., not only with the actual level of performance). I'm not quite sure what the implications are for modeling or forecasting player performance.

Posted 1:27 p.m., March 15, 2004 (#27) - tangotiger
Wally, I'm not disagreeing with what you are saying. But, it can be avoided for what I'm talking about. If you just concentrate on the 30 starters with the most PAs over the last 3 years, and the 30 relievers with the most saves over the last 3 years, the selective sampling would be similar.

Our expectation therefore is that the spread of the forecast (any forecast) for those starters must be much smaller (at least as a group) than the spread of the forecast for those relievers.

Anyone who looks at the prob distribution spread of PECOTA (or any other forecast) must expect this. If they don't get it, then I would say that that forecast engine would have to explain itself.

Posted 1:55 p.m., March 15, 2004 (#28) - Ex-Ed
Regarding these last two posts, I think both tango and wally are right.

Having said that, Wally does not need to remind us that nate does not need to reveal the ingredients in his special secret sauce in response to what I am about to say.

From the various PECOTA writeups, it's not clear that the playing time forecasts are estimated simultaneously with the forecasted rate stats. If they are, then one would expect to see greater forecast variance for pitchers with fewer forecasted BF. But if they are not, and the rate stats and the BF stats are forecasted independently, then multiplied together to get counting stats forecasts, then we wouldn't necessarily see the reliever/starter forecast standard error patterns that tango rightly expects to see.

Posted 2:11 p.m., March 15, 2004 (#29) - tangotiger
I just went through the Angels pitchers, starting from the top, and stopped in the middle. I looked for pitchers with no, or almost no, MLB experience. These are guys who we must be less sure of, if only because minor league data cannot be as reliable as MLB data. And, to boot, I split them off as being minor league relievers or minor league starters.

Andra 2.04 3.66 5.59 56% 153% Reliever
Dunca 3.32 5.16 7.76 64% 150% Reliever
Jones 2.29 3.7 5.62 62% 152% Reliever
...

Bootc 3.12 5.04 7.46 62% 148% Starter
Cyr 3.71 5.31 8.74 70% 165% Starter
Fisch 3.28 4.49 6.59 73% 147% Starter
Green 3.23 4.79 6.75 67% 141% Starter
Hensl 3.08 5.02 7.84 61% 156% Starter

Notice a pattern? That's right... not much. The relievers are between 60 and 150% of their 50th percentile, while the starters are 65-70 and 150%.

These reliever spreads match the spreads of Donnely and Percival (see post #23). These starter spreads are pretty close to the spreads of the experienced Angel starters.

We shouldn't confuse the quality of the mean forecasts of PECOTA and the quality of its prob distribution forecasts.

Posted 3:14 p.m., March 15, 2004 (#30) - J Cross(e-mail) (homepage)
mathteamcoach sent me rototimes and shandler projections and I stuck those in a spreadsheet with pecota, zips and marcel.

Here are the top 4 hitters PECOTA doesn't like comared to average:

1. Bill Mueller
2. Javy Lopez
3. Vlad Guerrero
4. Jose Guillen

3 of the 4 are guys who had "breakout" years well above past performance.

top 4 notable (ie everyday) players PECOTA does like:

1. Adam Dunn
2. Nick Johnson
3. Tony Batista
4. Carl Everett

2, 3 and 4 are all players acquired by the Expos in the offseason. BP has an extreme park factor for the Expos (Nate Silver mentioned this in a chat) which also explains Vlad's bad PECOTA projection. Actually, I think it's pretty amusing Prospectus' expos park factor is enough to push the all 4 players that moved to or from the 'spos in the offseason to an extreme projection. It easily the most dominant trend in pecota projections compared to other systems.

Adam Dunn get a very high pecota projection b/c of his extreme isolated power and the fact that isolated power is the #1 basis for finding comparables in the pecota system.

So, what have we learned?

pecota likes:

1. The Expos home parks
2. Isolated power

pecota doesn't like

1. players with 2003 breakouts.

I should have more to come on this and I'm hoping it'll get more interested when I get the pitchers in there.

Posted 4:01 p.m., March 15, 2004 (#31) - tangotiger
Actually J, you should only compare it to the Marcels, since that's the only system that we know what it's doing.

What the r between Marcel and the others?

Posted 4:21 p.m., March 15, 2004 (#32) - MGL
Ballpark factors fluctuate like crazy from year to year, even indoor ones, like in Montreal. Unless you heavily regress AND use many years of ballpark data, you are asking for trouble. A ballpark factor never changes unless the park iteslf or the rest of the league changes. Montreal has not changed anything about their stadium since 1977 (I don't think off the top of my head). Last year, it "played" as a tremendous hitter's park. In 2001, it was a hitter's park as well. Over the long-haul, Montreal is a neutral, if not slight pitcher's park. In fact, I use .98 as the OPS park factor in Montreal (based on 10 years of component data). To use anything but neutral park adjustments for Montreal hitters (and pitchers) is crazy (when a park "plays" as hitter's park, there is a teenie-weenie suggestion that the hitters are "suited" for that park).

I want to reemphasize that how a park "plays" in any given year has nothing to do with how you should adjust a player's stats in that year (other than the fact that the long-term park factor is changed by the addition of that year's stats). I would suspect that Pecota has the same type of "bias" for past and present KC players, as Kauffman has "played" as an extreme hitter's park for the last 3 years, even though the long-term park factors suggest that it is only a slight hitter's park (after the most recent changes in dimensions/turf).

I think that if we examine individual Pecota projections and even biases and patterns within all the projections, we will find a lot to criticize, as any "blanket" projection engine is going to miss the boat in a lot of individual cases, unless a human being goes through each projection with a fine tooth comb (maybe Pecota does that - I don't know). When the smoke clears, however, the fact that Pecota appears to be as good or better than most of the other forecast engines says SOMETHING about what it does.

Whether Pecota's "error bands" are doing anything at all (or whether they are all screwed up) can easily be checked I would think. If someone like Tango were to do some "generic" Q&D error bars for X number of players, based on projected playing time and amount of historical stats (the 2 major components to a "generic" error bar), we could easily compare these to Pecota's error bars for, say the 2003 projections, by looking the variance in actual 2003 stats. For example, if Tango has 50 players who should all have around the same variance (width of error bars), but Pecota has half of them with a much narrower band and half with a much wider band, we can see what the actual variance of 2003 perforemnce was in these 2 groups. If there is a significant difference (in the right direction of course), then it would suggest that Pecota is doing something really cool (and correct) with their "funky" error bands. In general, I suspect that 95% of the reason that Pecota does well is simply because it is a good Marcel-type engine. The other 5% MAY be due to their "secret, yet powerful" "similarity thing," although I remain skeptical that it is any help at all. I remain even mor skeptical that their "error bands" have any useful meaning whatsoever...

Posted 5:03 p.m., March 15, 2004 (#33) - tangotiger
MGL, please remember that Nos Amours played 25% of their home games away from the Big O. Can you publish your non-regressed PF for Montreal (and San Juan), and regressed for the last few years?

Posted 5:41 p.m., March 15, 2004 (#34) - MGL
MGL, please remember that Nos Amours played 25% of their home games away from the Big O.

Yup, I keep forgetting that...

Posted 5:49 p.m., March 15, 2004 (#35) - J Cross
correlation with Marcel

pecota OBP: .893
pecota slg: .882

zips obp: .846
zips slg: .879

rototimes obp: .901
rototimes slg: .893

shandler obp: .864
shandler slg: .852

just as a reminder, OPS corr. with actual last year (min 300 PA):

pecota: .711
zips: .692
rototimes: .683
shandler (bbhq): .690

Posted 5:59 p.m., March 15, 2004 (#36) - J Cross
The pecota v. marcel list looks reasonably similar to the pecota v. average list. Adam Dunn and guys will with ISO are still near the top alogn with Montreal players.

Posted 6:23 p.m., March 15, 2004 (#37) - Wally Moon
If you regress PECOTA against Marcel (with Marcel as the indep. or predictor variable), then you can generate residuals and look for cases where PECOTA is +/- 1 std. dev. from the Marcel prediction. You can then, of course, add "park" or "team" to the regression equation to see how much that accounts for the remaining variance. With OPB, you are already accounting for 79.7% of the variance in PECOTA using Marcel (square of the correlation coefficient), and you are already accounting for 77.8% of the variance in SLG. Thus, you have about 20-22% "unexplained" variance in the predictions -- so how much of that error is reduced when you add "team" or "park," or, for that matter Tango's thing here: PA.

Thanks for considering this added wrinkle.

Posted 6:25 p.m., March 15, 2004 (#38) - Wally Moon
I meant when adding "park" or "team" to add dummy variables for the teams (all except 1, of course, as the reference category).

Posted 9:19 p.m., March 15, 2004 (#39) - Michael
Does anyone have the PECOTA numbers for 2003? When BP added 2004 PECOTA 2003 went down (they reused the URLs and don't have year in them). Nate says 2003 is coming back, but it isn't a high priority on when that is. If you had the 2003 probability bands you could at least look at the players forcast in 2003 and see what percentage were over/under their various projections. I.e., if 24.5% of players played below the 25% numbers that might be pretty good. If 48% of players were below the 40% numbers that might not be as good. If the distrubution of actuals mirrors the projections then there is some confidence PECOTA is doing a good job with its bands. Especially since as tango points out the PECOTA bands are *not* simply saying here's what you'd expect via binomial to occur if true talent was X over Y PA.

I had planed to do just this study but the 2003 data went away. (I wanted to know how much I should trust the 2004 bands when I was doing fantasy valuation).

Posted 10:51 p.m., March 15, 2004 (#40) - tangotiger
It might be worth asking Nate for a CSV file of the 2003 data, at least OBP, SLG, and PA for each player/band.

One good thing that Nate shows with the bands is that as the player's performance gets worse, he gets less playing time. That's a good thing, as that models reality. That is, if your true talent is an ERA of 4.0, and you are performing half-way through the season at a 5.5 clip, you will get less PAs the rest of the way. Even if the rest of the way you continue at your true rate of 4.0, your overall rate will be around 5.0, and your PA will be less than forecasted.

Like I said in another thread: you really need another dimension to the forecast. I would do it something like this:

90%: 3.00 ERA
75%: 3.50
50%: 4.00
25%: 4.50
10%: 5.00

Given 3.00 ERA
90%: 260 IP
75%: 230 IP
50%: 210 IP
25%: 190 IP
10%: 160 IP

Given 3.50 ERA...

And so on. Then

Given 3.00 ERA

90%: 1.0 K / IP
75%: 0.8 K / IP
etc, etc

So, there's alot of dimensions going on here. I think Nate does a good job in presenting it as he does. But, I think that hides what's really happening with the probability distributions.

In any case, as MGL said, what does it really matter. If you have 2 guys with a forecasted ERA of 3.00, but one has an SD of 0.50 and the other has it at 0.25, do you need to apply risk aversion? Aren't they both equals? Even applying a non-linear salary to each probability distribution, I don't think you'll have much difference.

Posted 11:57 p.m., March 15, 2004 (#41) - MGL
I was actually asking the question, of what value is accurately estimating the bands? I don't see any obvious (significant value), but surely there is some. I suppose that if you are the overwhelming favorite to win your division, you want narrow bands for your players, but I'm not ewven sure it matters much there. If you are DET (or TBA, et al.) with no shot at making the playoffs, it is not clear whether you want a high probability of simply improving (good players with narrow bands) a little or you want a shot at improving a lot but possible not improving at all (average or good players with wide bands). I suppose that you could create an argument that if you have a low payroll, you are forced to go with cheaper players with wide bands and hope to get lucky. Again, I'm not real sure it makes all that much difference, and the issue here is whether Pectoa is forecasting these bands any more accurately than "Marcel" (the "confidence band" version of Marcel) can do.

Surely Nate would have already tested all of these so-called innovative engines and algorithms. Wouldn't he have?

Posted 12:00 a.m., March 16, 2004 (#42) - Michael
It totally matters. First of all it is not the case that all players have equal and symmetric distributions. But even in a case like your example where they do thing of the following players:

Pitcher A E(ERA) = 4.00, SD(ERA) = 1
Pitcher B E(ERA) = 4.00, SD(ERA) = 0.25

Imagine that you are the yankees picking the guy you want to be your 5th starter. Maybe you are comfortable with the guy who is pitcher B because you know with him pitching you'll almost definitely expect to win 50+% of his games regardlss. I.e., you want the less risky guy.

Now imagine that you are instead the Jays and you are trying to fill your #3 or #4 starter. You might prefer pitcher A, because in order to catch the Red Sox and Yankees you pretty much need a 1 SD better than average type luck to make the playoffs and as a result you might be risk seeking.

Posted 1:47 a.m., March 16, 2004 (#43) - MGL
Michael, that's basically what I said. What I am not sure of is whether it is going to matter more than an iota. There simply aren't going to be a whole lot of players who have markedly different error bands. Even when there are, you would still have to calculate how much one player is going to affect the team's distribution of wins and losses. Even with the extreme example you give (a SD of 1 verus .25), I'm not sure that it would have much impact on the team's chances of winning the division one way or another. More importantly, a team usually has a choice of pitcher A who is 4.0 with a SD of 1 run or pitcher B who is 4.5 with a .25 SD. Greater confidence usually comes with a price. I doubt that you could tell me off the top of your head which pitcher is better for the Yankees or even for the Jays...

Posted 7:51 a.m., March 16, 2004 (#44) - Ex-Ed
I raised this in an earlier pecota thread, but testing the utility of the measures would be very easy if one had all the data.

1. 2 x 2 chi-squared test of independence
improve/~improve x predicted improve/~predicted improve

2. simple regression/difference in means for collapse

dep var = % chance collapse
ind var = collapse/~collapse

3. repeat 2 for breakout.

In all three cases the null would be that having the measures in hand gives you no more information than a coinflip (where p can vary).

Posted 11:49 a.m., March 16, 2004 (#45) - tangotiger
Michael, I don't disagree with you, but you haven't specified the impact.

Posted 2:21 p.m., March 17, 2004 (#46) - J Cross
well, just for fun I muliplied the PECOTA "breakout%" by the "collapse%" for all players with 300 or more AB's. Then I controlled for AB's and ranked players by "brk%*clp% over expected." I didn't put much thought into this it was just quick and dirty. Anyway, the top "Wild" ie high surpise factor players are:

Edgar Martinez, Rocco Baldelli, Johnny Peralta, Jose Reyes, Ramon Santiago, Frank Thomas, Jose Lopez, Carl Crawford, B.J. Upton and Matt Kata.

To me this just looks like a group of very young or very old players. Their ages: 41, 22, 22, 21, 24, 36, 20, 22, 19, 26. The average age of players projected to have 300 ab's or more was 28.6 with a stdev (4.4) so stricly speaking only one of these players was within 1 std (although 24 is close).

Anyway, what suprised me was the most "reliable" players:

#1 Javy Lopez, #2 Barry Bonds

Javy Lopez!?!? Gosh, if there was a player I thought I didn't know what to expect from...

Anyway, the ages of the top 10 "reliable" players:
33, 39, 33, 37, 32, 32, 30, 28, 26, 33... a lot of 33 yr olds.

Not sure there's much to see here but I'll keep fiddling with the data. I had to use the breakout/collapse rates b/c I don't have the precentile projections in a spreadsheet.

Posted 2:31 p.m., March 17, 2004 (#47) - AED
MGL, the bands have little predictive value. Most of the error in any decent projection will be caused by randomness in the upcoming seasons' stats, which that is Gaussian.

For example, in the sample of players with 300+ PAs, the expected rms error from randomness alone is about 0.074 in OPS. This means that the projection errrors were around 0.05 in OPS. And if you take any reasonable distribution of projection errors and convolve it with a Gaussian that is 50% larger, you'll end up with something that looks awfully Gaussian.

So if a player has a strongly non-Gaussian band, all it tells you is that Nate couldn't find very many comparable players.

Posted 2:56 p.m., March 17, 2004 (#48) - J Cross
btw, AED, I used your basetball rankings for my bracket. Thanks.

Posted 9:25 p.m., March 17, 2004 (#49) - MGL
btw, AED, I used your basetball rankings for my bracket. Thanks.

I didn't realize that if you change (remove) 2 letters you can go from basketball to baseball. Is "basetball" some kind of new game? Like from the movie "baseketball"?

Posted 10:33 p.m., March 17, 2004 (#50) - J Cross
It's the new misspelling game. All the rage on primer.