The Forecasters Challenge 2009

Pros v Joes

Background

As I noted in announcing the winner:

So, in order to replicate the draft model, I need to put in one (or two) forecasting systems in a league of “amateurs”, and see how well they hold up. It doesn’t really matter if Marcel sees Mauer as a 20 or 22$ player, if in every draft Marcel finds himself in, there’s someone who sees Mauer as a 26-29$ player. It’s really irrelevant therefore if Mauer is a 2$ or 22$ player: he’s completely useless to Marcel.

And so I did replicate the draft model. I ran three different tests, the first two which I rejected, for reasons which you will see. And the third one is the keeper.

Setting up the test

In every test, I pitted one Pro against 21 Joes. In draft #1, I'd select my first Pro, say Marcel, and put him up against 21 Joes, which I will call Joe #1 through Joe #21. Marcel picked first, and then I went through it snake-style. In draft #2, I replaced Marcel with another Pro, say PECOTA. I kept the same 21 Joes and the same draft order. I did this for all 22 Pros. That gave me 22 drafts. Then, in draft #23, I had Marcel pick second, and he came up against 21 new Joes, say Joe #22 through Joe #42. And I continued on and on. In the end, I have 484 drafts, with each of the 22 Pros participating in 22 drafts.

The only draft lists I have are the ordinal rankings of 22 Pros. I also needed to create a reasonable list of common fan rankings, to represent the Joes. The way I proceeded was to first construct the consensus rankings from The Pros. This became my master list. From the master list, I converted each of the rankings into dollar values. Then, for each player, I chose a random number from -5 to +5 and added that to the dollar value of the master list. I reordered the list by dollar value. This became my reasonable list for Joe #1. I repeated the step of starting with the master list, and once again choosing a random number for each player. That became the list for Joe #2.

The purpose here is to try to juggle the master list somewhat, so that everyone has a unique, yet reasonable, list. For example, if the consensus value for Albert Pujols was 35$, and it was 28$ for Nick Markakis, then the random number generator will possibly assign a value of $31 to Pujols and 32$ to Markakis, thereby pushing Markakis ahead of Pujols in Joe #something's list. The idea here is to create a typical set of Joes that you would face in a real draft, for the Pro to come up against.

Test 1

And this was a disaster. With 484 drafts, and the Pros comprising 1/22 of the participants, we'd expect the Pros to end up winning 22 times total, if they were as good as the Joes. Since we think they are much better, they should win far more. They won only 15 times, with 11 of those wins coming from our champion (Eric Hanson) and runner up (Rotoworld). 18 of the Pros never one a single of the 22 drafts they were each involved in.

Basically, when you have one Pro against 21 Joes, even if you have 20 Joes with a ridiculous draft order, as long as you have 1 Joe that closely matched the consensus, he would beat the random Pro. And beat he did.

Test 2

Seeing that the consensus-based 21 Joes were still too smart, even though I would randomly shuffle the list somewhat reasonably, I had to throw in another wrinkle. I decided for each draft that I would throw away 10% of the player pool from the Joes, but keep it for the Pros. So, 10% of the time, Pujols was not anywhere on the 21 Joes' lists. This opened up the possibility that when the Pros drafted late in the first round, that they would get Pujols. Or, if they drafted early, and they had Hanley Ramirez ranked high, they would get Pujols with a late round 2 pick. I did this for every player. 10% of the time, Hanley Ramirez was not in the Joes' lists. And so on.

This too was a disaster, but going the other way. This time, the Pros won virtually every draft, 453 of the 484 drafts. Even the worst of the 22 Pros won 17 of the 22 drafts he was involved in. In my need to try to construct a reasonable set of Joes for the Pro to face, the balance was too much to the other side.

Test 3

I did the same as in Test 2, but I threw out 5%, instead of 10% of the players in each draft for the Joes. So, instead of the Pros being guaranteed Albert Pujols at least 10% of the time, they only got the guarantee 5% of the time. This time, the results were much more reasonable.

The Pros won 155 of the 484 drafts, or 32%. This would seem to be both a reasonable number, as well as being large enough that we can try to separate the big Pros from the little Pros. Remember what we are after: which Pro does the best against a reasonable set of Joes. With 155 wins spread out to 22 Pros, or about 7 wins each, we can see which Pro was able to stand out from the rest of the Pro pack.

ID	POINTS	WINS VALUE	PROS
122	1603	15	182	Rotoworld
218	1579	13	162	Steamer
102	1547	11	160	Ask Rotoman
101	1477	10	138	ANONYMOUS1
115	1509	10	136	John Eric Hanson
121	1495	10	126	RotoExperts
109	1478	9	124	Christopher Gueringer
112	1477	8	116	FantasyPros911
204	1507	8	113	Baseball Primer (ZiPS)
116	1439	7	103	KFFL
203	1498	7	101	MGL
113	1434	6	93	FeinSports.com

Bottom 10, alphabetical order, ID / Name:
105	Brad Null
106	CAIRO
207	CBS (proxy)
108	Chone
110	Cory Schwarz
111	Fantasy Scope
214	Hardball Times (proxy)
217	Marcel
119	PECOTA
120	Razzball

To explain the first row: Rotoworld had an internal ID of 122 (means nothing other than to my computer program), averaged 1603 player points, won 15 of the 22 drafts against the Joes, and earned 182 place-points (11 for 1st, 5 for 2nd, 3 for 3rd, 2 for 4th, 1 for 5th).

Steamer, which you will remember from last time is a group of high school students who only bothered to supply an ordinal ranking of 480 players, finished second. It should be noted that Steamer supplied the rate stats, and the playing time forecasts were taken from the Community. And the rest of its rankings were rounded out by Marcel. That is, once Steamer's draft list was exhausted, I went to Marcel's draft list.

Conclusion

If you remember from when I announced the official winner, Rotoworld finished second behind Hanson. The official rules was that I put the 22 Pros in 1000 drafts, competing against each other. In the same article, I also ran a Pro v Pro draft (just 2 Pros in a draft, no Joes), and Rotoworld finished 3rd. And in this final test, Pro v 21 Joes, Rotoworld finished first. This last test, to me, is the most satisfying in terms of modeling reality. It's one Pro against 21 Joes, which mimics reality. However, the head-to-head results from the previous article was also very powerful, since you are getting credit for your entire draft list. So, in terms of "learning" about which forecasting system, top-to-bottom, has the best rankings, the head-to-head results is the one to bank on. But, this last test, the Pros v 21 Joes, is where a Pro can put his cajones out there on certain players, that if he REALLY wanted to get someone, and if he's competing with 21 Joes, he's going to have to make a bold pick every now and then.

This is best exemplified with Marcel. In the head-to-head, Pro v Pro, Marcel won an astounding 31 drafts and lost only 11. Top-to-bottom, its draft list was reasonable, with no crazy picks. Basically, a very chickensh!t draft order. But, when faced with 21 Joes, Marcel did not have any "out there" pick, no player that it really wanted no matter what. For the very reason that Marcel finished so well in the head-to-head competition (regression was its friend), Marcel was an after-thought when put into a real-world competition (5 wins out of 22 drafts, and ranked 18th out of the 22 Pros). Regression was its enemy.

Steamer, on the other hand, had the opposite problem of Marcel. Even though they used the identical Community Playing Time Forecasts, Steamer finished in the bottom in the head-to-head, while finishing near the top when faced against 21 Joes. Basically, it was able to identify a few "out there" picks, which served it well against the Joes, but was no match when the entire draft order was looked at.

So, when you are going to see the countless forecast evaluations out there, remember that it's possible to finish very high or very low in the evaluations based completely on the evaluation test. What you really want is to evaluate the systems as you think its needs to be evaluated based on how you need the forecasting system to perform. What if, for example, you come to the table with your own rankings, and then you consult another forecasting system, and use that to bump up your rankings up or down a bit. Then what? I don't know. You need to model it. And that's the point here. Anyone who makes a claim that their system is the best needs to qualify it, as I have done here. Rotoworld, for one, can claim under three distinct tests (all 22 Pros together, Pro v Pro head-to-head, and Pro v 21 Joes) that it came out very well. But perhaps it would not do so well under some other set of conditions. And perhaps those conditions are what you, as an individual, are really interested in.