Tango on Baseball Archives

© Tangotiger

Archive List

More Help Requested (March 4, 2004)

Those guys who deal with the handling of sample data may enjoy what I'm about to ask. Or at least appreciate. I'm in a position now that I want to be able to discard certain ballots from the Fan's Scouting Report. Here's what I'm thinking, and maybe the pros here can help me out a bit.

Let's start with Bernie Williams' throwing arm. I know it's terrible, you know it's terrible. Everybody knows it's terrible. Of the 40 ballots that came in, 30 or so agree with me, 7 thinks it's "fair", 2 think it's "average", and 1 thinks it's "great".

Now, when I pull up the person's ballot that has Bernie as great, I see that this person marked Bernie as great in everything. And, looking at the rest of this person's ballot, and I can see that it's a pile of junk. That's an easy one to throw away. But, those people who said "average", it certainly looks like they did it ok, based on the rest of the ballot. They really believe he has an average throwing arm.

What I'm thinking is that if I start off with the overall average for that player (say Bernie was 1.3), then I put a mark on all ballots that have a rating that is more than 2 points away. So, if you put Bernie as a 4 or 5, you get a mark. I do this for all 7 characteristics and for all 499 players.

Anyone that has says more than 3 marks on their ballots, I should just toss them away.

What do you guys think of my approach, and do you have a better way to check for junk ballots?

Thanks...


--posted by TangoTiger at 05:19 PM EDT


Posted 8:44 p.m., March 4, 2004 (#1) - Nod Narb
  When I worked in a cognitive psych lab investigating reaction time to visual stimuli, we had a program that automatically discarded all responses that were >3 SD from the mean. Maybe you'll want to parse out data in the same way. As long as your sample size is big enough for a given player.

Posted 9:42 p.m., March 4, 2004 (#2) - Alan Jordan
  What you are seeing is called a halo effect. That is ratings for different traits are correlated based on an a general emotional preference. It's common in political polls and marketing studies (people who liked Clinton more likely rated him as trustworty and people who liked Quayle (sp?) more likely rated him as intelligent).
Its considered an almost intractable problem. The guy in question would probably Bernie's children were biologicaly possible.

Here are two other points of view on data removal.

1. Don't throw any away. You're asking for opinions and even uninformed opinions have some merit. If they don't then you need to survey only experts or people with minimum competency. Also some people might rate all players very low/high. These people's data might be removed even if their rankings correlate well with the total. Trimming data (deleting cases beyond a cut off) and windsoring (changing cases beyond a cut off to the cut off itself) can cause their own problems if not done right. They reduce sample size and response variance. You can be removing signal as well as noise by removing cases.

2. Remove cases where there is no variance within player ratings. One rule might be if the sum(std(player ratings))<=C then delete. Where C is some constant, possibly 0. If they show very little variance between players and variance within players then they are basically adding a constant to all ratings. In effect they have removed themselves.

Whatever you decide to do, you should analyze the data with and without the change. If there are differences, then note them even if you focus your report on one method.

Posted 11:53 a.m., March 5, 2004 (#3) - tangotiger
  Actually Alan, I specifically asked only for honest opinions of those people who watched at least 20 games of the player. Here's the full ballot in question, with the person's email removed. (To that person: thank you so much for adding a few hours of work to my plate to look for other junk ballots.)

Coomer, Ron 1 1 1 2 1 2 2
Giambi, Jason 1 1 1 1 1 1 1
Jeter, Derek 4 5 5 5 5 5 5
Johnson, Nick 2 2 2 2 2 2 2
Matsui, Hideki 2 2 2 2 2 2 2
Mondesi, Raul 1 1 1 1 1 2 2
Posada, Jorge 5 4 4 4 4 4 4
Soriano, Alfonso 3 3 3 3 3 3 3
Spencer, Shane 3 3 3 3 3 3 3
Ventura, Robin 3 3 3 3 3 3 3
White, Rondell 1 1 1 1 1 1 1
Williams, Bernie 5 5 5 5 5 5 5

That to me just reads junk. While Shane Spencer's line is defensible, there are enough junk picks in there that I have no problem believing that this person was not being honest.

You can easily flag several picks in there as being indefensible. Bernie's arm is nowhere near as comparable to Vlad. It's comparable to mine.

Your point about analyzing the data with and without the changes is a good one. I can almost see 3 types of ballots: (1) honest, (2) questionable, but maybe defensible, (3) lies.

I have no problem taking the above ballot as a lie, and removing it as if it didn't exist. I'm hoping that I won't have many, if any, of ballots in case #2. If I do, then I would follow your suggestion.

Thanks for the thoughts...

Posted 11:55 a.m., March 5, 2004 (#4) - bob mong (homepage)
  I agree with Alan's last sentence. I think it is the most important point to consider.

Whether or not you throw away any ballots, you should perform your analysis on the entire sample AND your modified sample, and note any differences.

For my job, I read a lot of articles from medical journals, almost all of which are the results of a clinical trial. Almost without exception, every single article analyzes the results using what they call "intent-to-treat analysis." This means that, whether a particular patient actually received the treatment they were supposed to, or any treatment at all, all patients who were initially included in the clinical trial are analyzed as if they had received the treatment they were supposed to receive.

Now, this isn't directly analogous, since if you don't do that in clinical trials you can get some serious selection issues (i.e., maybe 90% of the patients who didn't end up undergoing any treatment did so because they were too ill to do so), and a similar problem doesn't exist for your sample. But, all the same, I think it is best to use the sample you have collected for your analysis. One bad/goofy datapoint out of thirty won't affect any end-analysis too much, I wouldn't think.

Posted 12:32 p.m., March 5, 2004 (#5) - Patriot
  That person didn't even try. Everybody gets a string of the same # in each category for the most part.

Posted 1:02 p.m., March 5, 2004 (#6) - J Cross
  Yeah, I think you make a good case for removing this ballot. I think it would be harder to call range or even hand ratings "wrong" but arm strength is right there for everyone to see and Bernie doesn't have it. In general have arm strength rating been less variable than other ratings?

Posted 1:07 p.m., March 5, 2004 (#7) - tangotiger
  Variability: I'll have to check that out.

One bad/goofy datapoint out of thirty won't affect any end-analysis too much, I wouldn't think

See, that's the problem. Take out the "5" and the SD of the remaining ballots on Bernie's arm is a .5. Include the junk ballot, and the SD is .8. The lower the SD, the more agreement there is on a trait. But, there's as much agreement on BErnie's arm as there is on all his other traits (using the junk ballot). Removing the junk, and it soars.

In this case, this was 1 out of 40 ballots for Bernie. Imagine those players with only 10 ballots.

I'm hoping that this yahoo is the only person that has this kind of an effect. We'll see...

Posted 2:14 p.m., March 5, 2004 (#8) - Alan Jordan
  If each row has the same score for all but rating and that rating has a score that differs by only one from the rest as in the following 2 examples
1 1 1 1 1 1 2
5 5 5 5 5 5 4

then the variance for that row is .143. That is incredibly low variance. The average row variance for this guy is .067 which is well below .143. Changing sum(std(player ratings))<=C to mean(std(player ratings))<=.143 since people rated different numbers of players, this person gets cut.
You can make the C be any value that you feel comfortable with (.143 is really, really low). That way you can have an easily coded objective measure of crap responses. You can also add other criteria with or statements if you want. Using Bernie's arm rating itself as a criteria for removal tends to make the ratings look like yours.

Posted 2:25 p.m., March 5, 2004 (#9) - tangotiger
  I only used Bernie as an example, because it was so obvious.

The problem with what you are saying is for cases like Ichiro or Beltran or Shane Spencer or Jeremy Giambi, where plenty of ballots have put strings of 5 or 4 or 3 or 1.. .i.e., they are even in their talent traits across the board. Agreed that to see that many strings on one ballot would be tough, but then again, some people only fill out a few names on a ballot. So, if someone only picks Cameron and Ichiro, I shouldn't flag that ballot (or if I flag it, I will accept it upon investigation).

My current preference is to use:
- ballot minus group average
- if ballot is > 2, flag
- if # of flags > 3, investigate to delete

However, thinking further, alot of the junk balloters are really lazy, and they would likely have used the same value across the board.

I'll have to think about it some more.

Posted 3:02 p.m., March 5, 2004 (#10) - tangotiger
  Using my flagging method in my last post, the person who put that Bernie ballot out was flagged with 11 entries out of 84 selections on the ballot. I found another one with 38 entries out of 76 selections, someone calling themselves a Sox fan, and marking all their players as 1 practically.

Here are the biggest offenders:
FanID n junk junkPercent
88 76 38 50
389 28 10 36
248 63 11 17
241 84 11 13 .... that's the Bernie fan
362 105 9 9

I've got over 400 ballots, so that's comforting. I'll check out the other ballots, but I'm sure they're of the same variety.

I should have implemented a registration system.

Posted 3:08 p.m., March 5, 2004 (#11) - tangotiger
  Using Alan's method of SD, fan 88, 389, and 241 are in the top 4.

Fans 248 and 362 that I 've flagged as junk, have a normal SD. It's almost like they tried to make their ballot look reasonable (like the exact opposite of what they really felt). I'll check out those ballots.

Posted 3:14 p.m., March 5, 2004 (#12) - tangotiger
  In reply to J CRoss, here are the average of the SD (take each player's SD by tool, and average that):

Instincts: .82
First Step: .80
Speed: .70
HAnds: .75
footwork: .77
Arm strength: .75
Arm Accuracy: .78

So, there's not that much agreement as you might expect.

Posted 3:21 p.m., March 5, 2004 (#13) - J Cross
  wow, I would have thought there would a LOT less agreement on first step (which you can't see on TV) then on speed or arm strenth. huh.

Posted 3:31 p.m., March 5, 2004 (#14) - tangotiger
  It might be a sign that the fans listen to the same announcers and analysts too.

Posted 3:50 p.m., March 5, 2004 (#15) - Joe(e-mail)
  I work in market research and we run across this (the Bernie Williams effect) quite a bit. Obviously the solution is to have a large enough sample size (N) so that no one respondent skews the data enough to make any difference. The only interviews we would not use are ones that don't get completed for technical reasons, or where the respondent declines to continue. It is your survey and you can conduct it how you wish but if you do decide to exclude certain respondents you will drive yourself nuts, a) trying to decide which ones to exclude and b) finding the time to pull them out. Bob Mong made a good point, especially if you do decide to exclude some of the respondents, analyze the entire group and the group minus the exclusions and then run some type of t-test to see if there is significant difference between the two groups. It depends on how much time you want/have to spend on the project.

Posted 3:58 p.m., March 5, 2004 (#16) - FJM
  Can you create a correlation matrix for each player using all the raters and their ratings of the 7 skills? I'm guessing the correlations are going to be very high across the board, even though the skills are theoretically independent of each other. In other words, if a player is perceived to be better (or worse) than average by a fan, he'll probably be viewed that way across the board.

Posted 4:02 p.m., March 5, 2004 (#17) - tangotiger
  I don't have much time for anything (but that's rarely stopped me unfortunately), and the SD are severely affected even with only 1 junk responder, as noted earlier.

If I look at that ballot in post #3, I can't just leave it in. I realize "where do I draw the line". But that ballot, and another ballot by a purported Sox fan listing all players as 1 in all traits. That's a gulf.

I'm also thinking that showing the results pre and post junking is going to be a time-waster. I don't see how anyone would ask me for the results with that junk ballot put back in.

I think what I will do is list all ballots that I consider junk. If I get a couple of you to say "hmmm, that ballot seems weird, but I can live with it", then I'll put it back in.

Posted 4:12 p.m., March 5, 2004 (#18) - tangotiger
  Correlation (r) between Speed and...

Instincts: .41
First Step: .78
Hands: .33
Footwork: .31
Arm Strength: .18
Arm Accuracy: .23

All of them: .84
All of them (Except First Step): .41
All of them (Except First Step and Instincts): .34

Posted 4:39 p.m., March 5, 2004 (#19) - Alan Jordan
  Tango,
do you need to analyze anything other than means and standard deviations for this project? Are you planning on doing regressions or ANOVAs with this data? If not, you can treat the morons and the halo effect as random error that cancels itself out in large sample sizes.

Posted 5:05 p.m., March 5, 2004 (#20) - tangotiger
  Means, SD, sim scores, regressions with UZR all at the player/Tools, player and position/tools level.

The SD won't cancel out enough with Bernie. His SD is either .5 or .8 for throwing arm. What I'm going to present is a "level of agreement", and show it either as the SD (.5) or convert that into some sort of percentage, like 75% or something. A .8 is the same SD as for an average player's trait. And one thing is for certain, if it wasn't for that moron, almost everyone agrees that Bernie is either a 1 or 2 for arm.

Maybe we can ignore that person's ballot without throwing it out. Say we have 40 Bernie/arm ballots, and it shows:
1 - 30
2 - 7
3 - 2
4 - 0
5 - 1

The group mean is 1.375, and the mode is 1.

So, 75% agree that he's a 1. 92.5% agree that he's a 1 or 2. 97.5% agree that he's a 1,2,3. 97.5 as 1,2,3,4.

In terms of "level of agreement", what if I weight the first number as "4", the second as "3", the third as "2", and the fourth as "1". This will give me a level of agreement of: 87%.

If I threw out the junk ballot, I'd get: 89%. So, its almost not worth throwing out.

I think this better expresses how the ballots look, than what the SD tries to do.

Thoughts?

Posted 5:56 p.m., March 5, 2004 (#21) - Alan Jordan
  "So, 75% agree that he's a 1. 92.5% agree that he's a 1 or 2. 97.5% agree that he's a 1,2,3. 97.5 as 1,2,3,4."

As long as the data is unimodal (only one peak) then you can use the percentage at the mode as your measure of central tendency and the percentage within 1 as a confidence interval. If it's bimodal then you have a problem.

"In terms of "level of agreement", what if I weight the first number as "4", the second as "3", the third as "2", and the fourth as "1". This will give me a level of agreement of: 87%."

I'm not sure what this would mean to anyone but you. Also the level of agreement would be higher if 3 were the mode and 1 & 5 were only two away instead 4. You could scale that, but it then becomes even harder to explain.

Posted 10:22 a.m., March 8, 2004 (#22) - tangotiger
  Alan, I agree that this would only make sense to me. Showing a "% of respondents who agreed on a 3/4 was 75%" is much clearer.

Posted 2:35 p.m., March 11, 2004 (#23) - MGL
  FWIW, I'll throw in a non-technical comment about possible "junk ballots."

You have to be really careful about throwing away a ballot or portion of a ballot that is an outlier in terms of the rest of the ballots or in terms of what you think you know about a particular player, for obvious reasons. The whole point of the project is to get as many subjective evaluations as possible in order to augment the objective data AND what we think we know about a player's skills. Throwing out a ballot that doesn't "look right" is similar to the notion of not regressing Bonds' stats towards a mean becuase we just "know" that he is indeed a great hitter.

Now, given that, the only criteria for throwing or not throwing out a ballot should be whether the person submitting it was "honest" or not in his evaluations, and/or whether they had some minimum level of competence. Obviously, you don't know for sure about either of these criteria. Your "goal" should be in ascertaining honesty and competency, but again, you have to be really careful about judging them (honesty and competency) by comparing a person's evaluations with those of others and what you think you know about a player or players.

If it were me, I would do two things in order to determine whether a ballot or portion of a ballot is so questionable regarding honesty and competence that it merits tossing: One, if the "pattern" of the evaluations "looks" artificial, I would consider throwing it away, even if the actual ratings look reasonable. Two, if enough of the ratings on one person's ballot are outliers, I would assume that there is a systematic dishonesty or incompetency. If only 1 or 2 player ratings are outliers and there are no suspicious looking patterns on a ballot, I would leave it alone.

What kind of statistical methods you use to ascertain the two things I mentioned, I don't think is that important. I think you could just as well, do it by the "seat of your pants." If a ballot "looks goofy" and you suspect that the person is not being honest or for whatever reason is not competent/diligent, then just throw it out. I know you know not to throw out a ballot just because one or two of a person's ratings are far from everyone else's or what yo consider reasonable. In fact, it is almost a given that you "want" some unusual ratings for all players, again, otherwise you are somewhat defeating the purpose of the study.

In the "Bernie's arm" example, if someone has his arm as average or even excellent, as long as their ballot otherwise looks OK, you definitely want to NOT throw out that ballot. You WANT unusal ratings like this, even though the unusual rating itself "suggests" that the person is not being honest or is incompetent. Ideally, you would want to pre-test or determine independently honesty and competency, but since you can't, you have to do the best you can with the ballots you get. I would try and err on the side of NOT throwing away ballots unless you think you have some INDEPENDENT (of the actual ratings) evidence of dishonesty or incompetency.

I wouldn't worry too much about it...

Posted 6:44 p.m., March 11, 2004 (#24) - tangotiger
  I agree that 1 or 2 outliers should stay in.

In the cases that I cited, in one of them 11 of them were outliers (that ballot at the top of this thread). In another 36 of the 72 selections were outliers.

In all, I had less than 10 ballots where at least 10% of the selections were outliers. I would only consider those, and as well, I would publish those ballots so that the reader will be free to add those back in if he so chooses.

Posted 10:03 a.m., March 18, 2004 (#25) - tangotiger
  I'm thinking about regression towards the mean, based on the number of ballots cast. For example, someone complained about NEifi Perez being one of the lowest ranked fielders in MLB, when he "surely" is above average. My problem is that only 4 people voted for Neifi, while 58 voted for Yankee fielders.

So, what I need to do is regress the responses, and apply a confidence interval. In this case, if Neifi is, based on 4 ballots, a 25 on a scale of 0 to 100, with 50 as average, and the SD is 20, I would want to convert that into something like:
true talent = 40
1 SD = 15
(numbers for illustration)

When I check out what IMDB.com does with their movies:
http://us.imdb.com/top_250_films

they note the following:

The formula for calculating the top 250 films gives a true Bayesian estimate:

weighted rank (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

where:
R = average for the movie (mean) = (Rating)
v = number of votes for the movie = (votes)
m = minimum votes required to be listed in the top 250 (currently 1250)
C = the mean vote across the whole report (currently 6.9)

So, it seems IMDB uses a regression towards the mean equation similar to what I use for baseball (x/x+PA). The key value in the IMDB weighting is the "m" value.

Any thoughts as to what that m should be in my case?

Posted 12:14 p.m., March 18, 2004 (#26) - J Cross
  I think IMDB should regress the the genre mean instead of the overall mean :)

Posted 6:58 p.m., March 20, 2004 (#27) - Alan Jordan
  I don't understand why you would have a term for the minimum number of votes in the equation. It appears that as m approaches infinity, wr=C and as m approaches 0, wr=r. Can we get AED to give us a rational for this equation?

Is it any better than doing a weighted average where we give the overall average a specific weight like 10, 20 or 100 votes?

I don't think I could justify a specific weight, but at least I would understand what I'm doing.

Posted 8:03 p.m., March 20, 2004 (#28) - tangotiger
  The "m" is the specific weight, in this case, 1250. As they use it here, they mean 2 things: 1) the regression towards the mean component, so, the fixed value of 1250, and 2) "oh, and by the way, we only list those movies that are regressed at most 50% towards the mean".

Posted 9:34 p.m., March 20, 2004 (#29) - Alan Jordan
  I get how the weighting works. There are a wide variety of weighting schemes that would be defensible whether they are bayesian or not. What I don't get is how is this bayesian? I didn't think bayesian systems used cut offs for inclusion. If you call something bayesian, then you start off with a set of assumptions and derive an equation that will give you the appropriate answer. I'm just not smart enough to look at this and see how it's bayesian.

I'm not saying that it's inadequit in any way.

Posted 11:12 p.m., March 20, 2004 (#30) - tangotiger
  I didn't think bayesian systems used cut offs for inclusion

I think the cutoff was something that was added and has nothing to do with Bayesian. Just like if I decided to give everyone 5 votes of "3" (average), that I still wouldn't want to include a guy with 1 vote of a "5". The total would be 3.33, but that really tells me nothing about that guy.

Same here. A movie get 100 votes of "10". But, every movie starts off with 1250 votes of "7". That would make this limited seen movie a "true talent" ranking of 7.2. Heck, my home movies would get a 7.1.

However, making the "m" as 1250... fine... I have no idea how they figured that one out, other than the way I would do it with OBA in MLB. 1250 sure looks way too high. But, then going out and also making that the cutoff? I guess they thought it would be too weird to show the sample average as 10.0, and the weighted average of 7.2.

Posted 12:51 a.m., March 21, 2004 (#31) - Alan Jordan
  I think I get it. M isn't the minimum number of votes to get into the list. It's the minimum number of votes of those already in.
If there were a cut off number of votes then there could be more or less than 250. What if 350 movies have more than 1,250 votes do they all go into the top *250*?
I think membership in the top 250 is based on the number of votes, and 1,250 is approximately the number of votes that #250 has. Ranking of the 250 is then based on their weighted average. They decided to give 6.9 the arbitrary (as far as I can see) weight of the number of votes of the movie that was ranked the least.

At first glance 1,250 seems way to high, but it has the effect of giving movies with the most positive votes more of a push.

Posted 12:52 p.m., March 21, 2004 (#32) - tangotiger
  Alan,

I had thought that they just drew a line at 1250, and whatever movies fell above that line (say 2000 thousand), those movies qualify. Then, using 1250 again (coincidentally) as the regression towards the mean component, then put out the weighted "true score" list.

However, what you are suggesting is that they first selected the top 250 unweighted movies, and the movie with the fewest votes in that list is 1250, and then proceeded. I don't think that's right.

Posted 12:37 p.m., March 22, 2004 (#33) - Alan Jordan
  Maybe you're right.

Posted 1:02 p.m., March 22, 2004 (#34) - Wally Moon
  I'm coming in a bit late on this discussion but I wholeheartedly endorse the advise you're getting from Alan Jordan in particular but also several others.

I would throw out only ballots that show completely unambiguously that the respondent did not attempt to perform the task of rating. Thus, the casea where a player gets identical high, or medium, or low marks in all categories is a potential (but not automatic) candidates for this. Such cases are akin to people doing a "sink test" in a medical lab -- throw the sample (observational data) down the drain and write up the analysis based on other criteria. However, you should be very conservative in excluding cases.

I would not throw out any other cases, but preserve them in the larger data set, and then later on either in your choice of measures of central tendency (e.g., medians vs. means, "trimmed" means --leaving off cases with scores > 2 S.D., or other alternatives) you should conduct sensitivity tests or tests of robustness to see whether your larger analysis is affected by the choice of measures or cases.

If you can include all except the most egregious "sink test" cases and still get nearly identical overall statistical results, then you will avoid the perception or accusation that you've cooked your data through selection of the data.

This is akin to running your analysis using a variety of alternative assumptions about missing data, or about the quality of data, and so on. You also want to avoid in principle throwing away variance by assuming that the central tendency based on "most" responses is "right" and the extreme cases are "wrong." In the end, that may actually produce effects in your analysis that are opposite what you might suppose. For example, it could lead to an attenuation of correlations across indicators or measures of performance rather than to an improvement of them, because everyone then starts to become "average" or "modal" in their observed behavior across multiple indicators.

So (1) keep all the data (eliminating only unequivocally sink-test cases); (2) use a variety of measures of central tendency; (3) conduct tests of sensitivity or robustness to see how much the inclusion of deviant or extremee cases affects the data analysis; (4) err toward inclusion of seemingly weird cases rather than exclusion and purification, especially if (as it likely) inclusion of these cases won't make much difference anyway; (5) report the results under different assumptions.

Posted 2:06 p.m., March 22, 2004 (#35) - tangotiger
  Here are the ballots I rejected (6), along with the reason. These ballots were rejected on the following basis: more than 2 selections (and 10% of the selections on the ballot) that differed from the mean by more than two levels. Out of nearly 500 ballots, only these 6 were rejected.

FanID Team Player Instincts FirstStep Speed Hands Release Strength Accuracy
210 TOR Cash, Kevin 1 1 1 1 1 1 1
210 TOR Delgado, Carlos 5 5 5 5 5 5 5
210 TOR Wells, Vernon 5 5 5 5 5 5 5
210 TOR Woodward, Chris 1 1 1 1 1 1 1

Delgado was an across-the-board "5". Having across-the-board "1" on Cash and Woodward didn't help either.

259 NYA Coomer, Ron 1 1 1 2 1 2 2
259 NYA Giambi, Jason 1 1 1 1 1 1 1
259 NYA Jeter, Derek 4 5 5 5 5 5 5
259 NYA Johnson, Nick 2 2 2 2 2 2 2
259 NYA Matsui, Hideki 2 2 2 2 2 2 2
259 NYA Mondesi, Raul 1 1 1 1 1 2 2
259 NYA Posada, Jorge 5 4 4 4 4 4 4
259 NYA Soriano, Alfonso 3 3 3 3 3 3 3
259 NYA Spencer, Shane 3 3 3 3 3 3 3
259 NYA Ventura, Robin 3 3 3 3 3 3 3
259 NYA White, Rondell 1 1 1 1 1 1 1
259 NYA Williams, Bernie 5 5 5 5 5 5 5

No one would ever equate Bernie's arm as the best in the league. Williams' best comp I think was Rondel White. Considering they both played on the same team, this ballot is just disgusting. Raul Mondesi's arm strength a "1"? The whole same-score listing for each player just shows how utter nonsense this ballot is.

311 NYA Giambi, Jason 2 2 1 3 2 3 4
311 NYA Jeter, Derek 1 1 1 1 1 1 1
311 NYA Johnson, Nick 4 3 3 4 4 4 4
311 NYA Matsui, Hideki 2 2 1 3 2 2 1
311 NYA Mondesi, Raul 3 4 4 3 2 4 4
311 NYA Posada, Jorge 1 1 1 1 1 1 1
311 NYA Rivera, Juan 4 5 4 3 3 3 4
311 NYA Soriano, Alfonso 2 3 5 2 2 3 2
311 NYA Williams, Bernie 2 1 4 3 2 3 3

This fan doesn't care at all for Derek Jeter. Alot of the other selections are justifiable, but this blatant attempt to bring down Jeter invalidates the whole ballot.

387 BOS Clark, Tony 5 1 1 0 5 5 5
387 BOS Damon, Johnny 1 1 1 1 1 1 1
387 BOS Daubach, Brian 1 1 1 1 1 1 1
387 BOS Garciaparra, Nomar 1 1 1 1 1 1 1
387 BOS Millar, Kevin 1 1 1 1 1 1 1
387 BOS Mirabelli, Doug 1 1 1 1 1 1 1
387 BOS Mueller, Bill 1 1 1 1 1 1 1
387 BOS Nixon, Trot 1 1 1 1 1 1 1
387 BOS Ortiz, David 1 1 1 1 1 1 1
387 BOS Ramirez, Manny 1 1 1 1 1 1 1
387 BOS Varitek, Jason 1 1 1 1 1 1 1

Yup, the worst fielders in the whole league at every position. The worst ballot I've ever seen.

396 CIN Boone, Aaron 5 5 5 4 5 5 3
396 CIN Branyan, Russ 2 1 2 1 0 0 0
396 CIN Casey, Sean 2 2 1 3 2 2 1
396 CIN Castro, Juan 4 4 3 5 4 3 4
396 CIN Dunn, Adam 1 1 4 2 5 5 3
396 CIN Griffey Jr., Ken 5 5 1 5 3 1 2
396 CIN Guillen, Jose 3 3 3 3 5 5 5
396 CIN Kearns, Austin 5 5 4 5 5 5 4
396 CIN Larkin, Barry 1 1 1 1 1 1 1
396 CIN Larson, Brandon 2 2 1 3 3 4 3
396 CIN LaRue, Jason 3 0 0 2 4 5 3
396 CIN Pena, Wily Mo 1 1 3 1 1 4 1

Similar to the Jeter ballot, this person thinks that Larkin is a rundown completely decaying SS, which is a far cry from the overall balloting.

407 BOS Damon, Johnny 1 1 1 1 1 1 1
407 BOS Garciaparra, Nomar 1 1 1 1 1 1 5

Again, an attempt to bring down Redsox players.

418 CHA Crede, Joe 5 0 0 0 0 4 4
418 CHA Lee, Carlos 3 3 4 5 0 0 0
418 CHA Valentin, Jose 1 1 1 1 0 1 3

Again, another fan that can't stand Valentin. The rest of the ballots had Valentin as: 3.9, 4.1, 3.4, 2.8, 3.1, 4.3, 2.1. That's 4 huge discrepancies.

Posted 8:14 p.m., March 22, 2004 (#36) - Heteroskedastic Tendencies
  I think the Reds and White Sox ballots are good. They're, in my opinion, incorrect, but they don't appear to be malicious or lacking in thought. I'd be suspicious of the Reds ballot because Larson has only started about 30 games at 3B in his career.

Perhaps you could e-mail the flagged respondents asking that they verify their selections. This could serve two purposes. First, it's possible the respondent misentered the numbers. I remember accidentally putting "1" as the best and "5" as the worst when going too fast. Switch the numbers on the latter Red Sox ballot and it looks reasonable. By reputation Damon is outstanding (although his arm sucks?) and Nomar has a strong, inaccurate arm.

Also, it could act as a Q and D test to see if the respondent believes what they're saying. If the same weird results occur, keep them, since more likely than not the respondent has a bias against that player(s).

I thought the point of the study is that the biases even out in the long run. If you throw out certain results, you're saying some biases are ok while others aren't. The ballots are only junk if the person doesn't believe what they're are saying, not if they're incorrect. Especially since, as has been mentioned, people who think highly of a player's tend to overrate that player.

There seems to be no problem accepting that Mientkiewicz and Pokey Reese are similar fielders. If that really were the case, why isn't Minky playing 2b? There is obviously some problems with the inter positional ratings.

Still, it does give some interesting results. The Jeter/Abreu comparison makes sense once I think about it. And the fact that the comps for the good fielders at certain positions are other good fielders at that position means that something good is happening, since those players excel due to a combination of skills well suited to that position.

Posted 11:32 p.m., March 22, 2004 (#37) - Hatrack Hines
  Heteroskadistic, in what way does the Jeter/Abreu comparison make sense to you? I'm curious.

Posted 7:12 a.m., March 23, 2004 (#38) - tangotiger
  The ballots are only junk if the person doesn't believe what they're are saying, not if they're incorrect.

I had already emailed 2 of the balloters, and they did not reply. Two others did not leave their email addresses. I'm going to email the other 2 today. As far as I'm concerned, these are not observational biases, but an attempt to force an outcome. Observational biases even-out. Since I set pretty loose standards to catch junk ballots, I'm sticking with this.

[an error occurred while processing this directive]