The Forecasters Challenge 2009

Background

I introduced the project on my site last year:

On May 9, 2008, and again on June 5, 2008, Nate Silver, architect of Baseball Prospectus’ PECOTA, and forecaster at the political site FiveThirtyEight.com, issued a challenge to fellow political forecaster Dick Bennett at American Research Group in picking the winner of the Democratic primary, state-by-state. That challenge went unanswered. Inspired by that challenge, and having done a multitude of Forecaster assessments in the past, the website TangoTiger.net issued a similar challenge to a large group of forecasters. Most accepted. With 21 forecasters and Marcel, the scene was set. The concept was simple enough:

We will create a simple scoring system.
Each Forecaster provides an ordered list of players.
Forecasters will be drawn at random to determine draft order.
We will create a program to automate the draft, snake-style, until the rosters are filled. That’s the team you get, no trading or moves.
We then repeat steps 3 and 4 one thousand times. (The running of so many drafts is to limit the amount of luck one good or bad pick can have on the outcome. It is unlikely that one Forecaster would have ended up with Cliff Lee one thousand times in 2008.)

Then, we’ll see how everyone does at the end of the year. The leader gets to promote the fact that they won The Forecasters Challenge. This challenge is being issued in a spirit of fun and sportsmanship. Here were the Full Rules and List of Participants.

The idea is that by having 1000 random drafts, then the players will get spread out to each of the 22 fans, so that each fan will get each player roughly 25-100 times each, for an average of 45. It didn't quite work out that cleanly, and I'll talk about that in a bit.

The Results

ID	POINTS	WINS	VALUE	FAN
115	1484	385	5672	John Eric Hanson
122	1423	187	3581	RotoWorld
203	1397	126	2743	MGL
101	1396	93	2403	ANONYMOUS1
102	1385	75	2220	Ask Rotoman
218	1379	59	1878	Steamer
116	1340	35	1198	KFFL
204	1298	29	974	Baseball Primer (ZiPS)
120	1269	14	544	Razzball
121	1295	0	385	RotoExperts
109	1246	2	194	Christopher Gueringer
112	1229	1	118	FantasyPros911
	1056	0	189	Bottom 10

Bottom 10, alphabetical order, ID / Name:
105	Brad Null
106	CAIRO
207	CBS (proxy) 
108	Chone
110	Cory Schwarz
111	Fantasy Scope
113	FeinSports.com
214	Hardball Times (proxy)
217	Marcel
119	PECOTA

John Eric Hanson is the official winner!

Let me explain all the columns. The first column is just a unique ID that means nothing to anyone except my computer program. Marcel is ID 217. Hanson had ID 115. The only meaningful thing out of the ID is that if it starts with a "2" then that means I had to do manipulation on the data in order to create a draft list. I'll talk about this in a second.

The second column is the average number of points the players for each team had in each draft. Again, these numbers are pretty meaningless without context. But, I show it here anyway.

The third column is the total number of drafts each participant won. Hanson, Rotoworld, and MGL combined to win 70% of the drafts.

The fourth column is the total value of the draft. I gave out 11 points for a win, 5 for a 2nd place finish, 3 for a 3rd, 2 for a 4th, and 1 for a 5th. Each draft had 22 points allocated (and with 22 teams, that means an average of 1 point allocated per team). With 1000 drafts, the average point allocation therefore is 1000 points. Hanson earned 5672 points out of a maximum of 11,000 points, which is fantastic.

A couple of more explanations

You will notice that MGL has an ID of 203, which means I had to manipulate his submission. MGL provided player names, and forecast lines (S, D, T, HR, etc). Basic stuff. Whatever he did not supply, I filled in using Marcel rates. Furthermore, he used the playing time forecasts from the Community Forecasts.

Steamer is a group of high school students who provided a limited number of forecasted players. The rest was supplemented by Marcel. ZiPS has similar issues to MGL, and therefore, has similar manipulations. CBS and Hardball Times required name/ID matching, and a few may have fallen through the cracks.

Some oddities

As I noted a few months ago:

Any of the lucky breaks of one draft because of draft position or whatnot gets wiped out if I run it a thousand times.
Cool, right? I was also surmising that everyone would have such similar lists that we’d all end up drafting each player around the same number of times. I really thought that since Pujols was going to be drafted 1000 times, and there are 22 of us, then each of us would draft him at least 30 times, and no more than 100 times. I reasoned that he would be a top five pick, everyone will more or less draft 40-50 times at each slot, so eventually, it would even out. Indeed, he was ranked #1 by 5 of us, #2 or #3 by another 5, #4 by another 6, and finally #5 or #6 by the final 5.
A funny thing happened to that theory. Of the 5 forecasters that had him ranked #1, he went to one of them 761 times out of the 1000 drafts. The 5 forecasters that had him ranked #2 or #3 got him in 157 drafts. The #4 rankers got him the other 82 times. Anyone who ranked him 5th or 6th simply never drafted Pujols in 1000 drafts.
And this repeated itself with a huge number of players. Sometimes, the rankings were so nuanced that a forecaster ended up drafting that player all 1000 times.
Let’s analyze a bit using Marcel The Monkey Forecasting System as the illustration. Marcel had Chad Billingsley, an excellent pitcher who is performing well as ranked 24th in his list. Remember, the automated draft program I wrote simply goes through everyone’s list and starts drafting (while ensuring that the position quotas are filled). There were many forecasters that had Billingsley drafted pretty high. One had him 25th. Marcel ended up drafting Billingsley 984 times, while the other forecaster got him the other 16 times. Quite a difference for such a small position! What happened is that the other forecaster had another player ranked much higher than everyone else (Rich Harden, ranked 14th), so that player was also going to that forecaster, leaving Billingsley almost always exposed.
This particular case worked out very well for Marcel. It wasn’t always so. Here are the 25 players that Marcel ended up drafting the most. You can consider these guys as the guys that Marcel liked more than anyone else:
ORDER_ID MLBAM_ID PLAYER_TX
5 407812 Holliday, Matt
24 451532 Billingsley, Chad
38 449107 Aviles, Mike
59 493137 Matsuzaka, Daisuke
65 467055 Sandoval, Pablo
69 425794 Wainwright, Adam
82 425426 Wang, Chien-Ming
109 451482 Galarraga, Armando
112 446209 Litsch, Jesse
119 434578 Saunders, Joe
142 150217 Guzman, Cristian
157 433584 Carmona, Fausto
161 460051 Getz, Chris
182 458690 Volstad, Chris
207 434633 Baker, John
215 460003 Teagarden, Taylor
322 150317 Crede, Joe
332 407487 Rivera, Juan
361 333292 Wilson, Jack
425 457788 Schafer, Jordan
486 334393 Pierre, Juan
503 346795 Chavez, Endy
513 123107 Tatis, Fernando
635 110383 Aurilia, Rich
Let’s look at a few of these players. You will see that Marcel thought highly of Dice-K, Chien-Ming Wang, and Fausto Carmona. Talk about bad luck. Was Marcel unusually excited by these players, or was it a case like Billingsley where had these guys just a bit higher than everyone else, and so, just ended up with them so disproportionately because, well, that’s simply the way the draft works.
Let’s start with Dice-K. Marcel had him ranked #41. Here is how high he was ranked by the five forecasters most high on Dice-K: #58, 59, 65, 71, 83. So, it’s not like Marcel was crazy in-love with him. He had him ranked a bit higher, but not unusually higher. The result though is that Marcel ends up with Dice-K in 897 drafts out of 1000! And Dice-K’s Fantasy points is one of the worst in the whole league.
In typical correlation studies, we wouldn’t notice much with this pick. That is, take Dice-K completely out of the correlation study, and Marcel’s correlation coefficient doesn’t change much, if at all. However, because he was drafted by Marcel nearly 90% of the time, and because he had an enormous collapse, it is Marcel that takes almost the whole brunt here. Those other forecasters who also had Dice-K ranked highly, get off almost scot-free here, because Marcel was there to bail them out!
Let’s continue with Wang. We all know what happened with Wang. Was Marcel, who drafted him 510 times as the 82nd best player, unusually excited with Wang? Here were the 5 forecasters most in-love with Wang: #68 (drafted 490 times), #135, #137, #145, #168. So, yes, Marcel should get penalized here. Marcel was actually saved by one other forecaster who had him ranked very high as well. Everyone else discounted Wang heavily (though of course, not heavily enough!). However, in drafting, you just have to discount him enough that he never gets drafted by you. So, whether he was ranked #135 or #494 (his lowest ranking by a forecaster), it’s the same thing! As long as you have one forecaster, just one, that is in-love with him enough, you will never draft him ahead of that guy.
Finally, Carmona. Marcel had him #157 and drafted him 452 times. There was one forecaster that had him ranked even higher at 119 and drafted him 541 times. The rest of the forecasters were not close.
If you had to draw up a list of pitchers who collapsed, these three pitcher would likely be part of the 10 biggest pitcher busts. And by bad luck, Marcel ended up with 3 of them.
At the same time, Marcel got Adam Wainright 665 times by ranking him 69th. Here’s how five forecasters ranked him: #71, 74, 76, 81, 82. So, Marcel benefits here by ranking him just slightly higher enough to end up drafting him two-thirds of the time.
I find these results utterly fascinating, and I can see myself spending an inordinate amount of time studying all this, running different drafts with different scenarios. Clearly, it is not fair to those forecasters who ranked Wainright high to not get any acknowledgement for it. After all, it is simply their bad luck to be in a draft with Marcel. But, Marcel will not actually be in every one of their drafts. Marcel is not necessarily a representative forecaster in a random fantasy draft.

Different draft options

A long-time reader, KJOK, suggested running a new set of drafts, but this team, just put two of the 22 forecasters in a head-to-head league. And since I need to have 550 players drafted, I let them each select 275 players. I thought this was a great idea. Basically, you are on the hook for a huge set of players. And, since Marcel will only be in a few drafts, then those who like Dice-K, Wang, and Carmona will draft them alot more than they did in the official draft. Here are those results:

ID	POINTS	WINS	FAN
112	14358	42	FantasyPros911
203	14190	38	MGL
122	14344	37	RotoWorld
204	14208	37	Baseball Primer (ZiPS)
217	14025	31	Marcel
113	13859	30	FeinSports.com
108	13751	29	Chone
105	13855	26	Brad Null
116	13784	26	KFFL
101	13822	25	ANONYMOUS1
120	13590	22	Razzball
115	13673	20	John Eric Hanson
	13025	99	Bottom 10

Bottom 10, alphabetical order, ID / Name:
102	Ask Rotoman
106	CAIRO
207	CBS (proxy) 
109	Christopher Gueringer
110	Cory Schwarz
111	Fantasy Scope
214	Hardball Times (proxy)
119	PECOTA
121	RotoExperts
218	Steamer

Now this is mighty interesting! Hanson, our official winner, is now middle-of-the-pack. MGL moves up from #3 to #2, while Rotoworld drops from 2 to 3. And FantasyPros911, middle-of-the-pack in the official draft, who won one out of 1000 drafts, here wins all 42 of its head-to-head drafts! Marcel ends up with 31 wins and 11 losses, which is simply fantastic. I expected Marcel to finish 21-21 because, frankly it does nothing special. But, doing nothing special seems to be wbat you should be aiming for because, I presume, (many) others overthought their forecasts enough to let Marcel finish with a fantastic record. As I noted a few months ago:

Here are the players that Marcel drafted in 40 to 42 of the 42 drafts:
ORDER_ID POINTS_CT PLAYER_TX
5 101 Holliday, Matt
24 77 Billingsley, Chad
38 -2 Aviles, Mike
44 27 Nolasco, Ricky
56 20 Slowey, Kevin
65 94 Sandoval, Pablo
69 108 Wainwright, Adam
73 95 Gallardo, Yovani
82 -29 Wang, Chien-Ming
109 14 Galarraga, Armando
112 -5 Litsch, Jesse
119 9 Saunders, Joe
123 23 Maine, John
127 5 Bush, Dave
138 70 Escobar, Yunel
139 47 Buehrle, Mark
157 -6 Carmona, Fausto
161 43 Getz, Chris
162 18 Fontenot, Mike
175 15 Maholm, Paul
182 30 Volstad, Chris
198 28 Lannan, John
206 16 Shoppach, Kelly
207 37 Baker, John
215 9 Teagarden, Taylor
229 21 Looper, Braden
234 13 Marshall, Sean
276 5 Owings, Micah
296 10 Martis, Shairon
309 -5 Harrison, Matt
425 2 Schafer, Jordan
477 40 Rasmus, Colby
503 22 Chavez, Endy
511 -2 Sanchez, Gaby
635 5 Aurilia, Rich
Aviles, Wang, Carmona, and Dice-K (38 times) are still featured prominently, but now we also add Sandoval, Buehrle, and more as players who were blocked from ONE other forecaster in the other competition, but not by the other 20 in head-to-head play. This is certainly more representative of the kind of draft list Marcel is putting out.

Conclusion

I have a solution here which makes more sense. And because I have all the draft lists, I can re-run all the drafts by changing any of the assumptions I want. This allows me to test various scenarios. Next time, I’ll do a head-to-head, but I’ll throw in 20 semi-intelligent random amateurs, so that everyone sticks to drafting 25 players. Basically, those 20 guys are going to pick off reasonable players from the draft pool, so that everything mimics reality. As I also explained:

I agree with Ron Shandler that we should reject the correlation studies of forecasting systems. They simply don’t do what they purport to do. It’s very clear what they do, but then those conclusions are taken out of that particular context, making those conclusions not applicable in the real-world sense.
The reality is the reality: what forecasting system should I follow in the draft? Playing time matters. It’s a huge component. ZiPS (#4), Marcel (#5) and MGL (#2) all used the Community forecasts. Without those forecasts, Marcel would be at zero (would have drafted Jeff Francis among others).
So, in order to replicate the draft model, I need to put in one (or two) forecasting systems in a league of “amateurs”, and see how well they hold up. It doesn’t really matter if Marcel sees Mauer as a 20 or 22$ player, if in every draft Marcel finds himself in, there’s someone who sees Mauer as a 26-29$ player. It’s really irrelevant therefore if Mauer is a 2$ or 22$ player: he’s completely useless to Marcel.
Basically, I’m being chickensh!t on Mauer (numbers for illustration only). It’s almost like I’m saying “You know, everyone loves Mauer, but I don’t, but I’m going to put his value high enough so that he comes out ok in the correlation studies, but not so high that someone might actually draft him on that valuation.” It’s nothing more than being a coward in a fight.
This is EXACTLY what Marcel does with the rookies: Marcel knows zero about players without MLB experience. What does he do? He simply presumes league average, with the minimal playing time (200 PA or 60 IP for starters or 25 IP for relievers). So, he values him just enough that the forecast appears reasonable, but the forecast will be so low as to be irrelevant in the real-world. The community playing time forecasts though turns his chickensh!t forecast into something better. If Matt Wieters is forecast for 400 PA by the Community, then Marcel is exposed a bit more.
Therefore, I would say that the only way to evaluate a forecasting system is to actually run it through a model. Not necessarily my models, but something better than correlation studies that puts limits on playing time to be part of the sample, and makes no effort to separate the chickensh!t forecasts from the real forecasts.

If there's a real winner here, it's the Community. As I noted when I ran that survey:

Who knows more about whether a pitcher will be in the starting rotation or the bullpen: an algorithm or a true fan? Who knows more about the number of games an injured ARod will play in 2009: an algorithm or a Yankees fan? There are certain human observation elements that are critical for forecasting.

Anyway, what I will do is treat this as a learning experience. I’m going to let the official rules stand in terms of awarding a winner. But I’m going to run modified versions unofficially. This will be my pet project in the off-season. Stay tuned...