Tango on Baseball Archives

© Tangotiger

Archive List

[Insert component name] Adjustment Factors (December 26, 2003)

I don't like the way park factors, or any adjustment factors are computed, because it allows for a player's own performance to affect the adjustment factor that is applied to himself. This is why I like what I did with the Evaluating Catchers article: I compare each catcher to all other catchers, not including the catcher himself.

The following I posted at Fanhome:


Take 1920 for example. You've got 8 park in the AL, one of which is Yankee stadium. You want to compute the HR factor for LH hitters.

There were 369 HR hit that year, with 54 by Ruth (let's assume 27 at home). Let's assume that of the other 315 HR, 104 were hit by LH hitters. Let's assume that 1/8 of those, or 13, were hit at Yankee Stadium.

So, we have 40 HR hit by lefties at Yankee Stadium, including 27 by Ruth. There were 118 hit in the other 7 parks (and let's treat them as all being equal parks), with 27 by Ruth, or an average of 17 HR per park (13 without Ruth).

Let's assume that Ruth hits .14 HR per contacted PA (i.e., AB-SO, and let's assume that's the correct opportunity), and that the rest of the league gets .01 HR per contacted PA.

Let's also assume that all RH also get .01 HR per contacted PA, at Yankee Stadium and the other 7 parks, if you need this information.

Question: what is the LH HR factor at Yankee Stadium?

Or, what is the probability of a LH hitter getting a HR at Yankee Stadium? (See? Since we know that there were 40 HR at Yankee Stadium, and 17 at the other parks, we can see that the probability will be twice as high at Yankee Stadium. But, this is ONLY because we "know" the identity of the one player that plays half his games there. So, Ruth is affecting the probability of a LH hitter getting a HR at Yankee Stadium, simply because of how much he makes up the sample.)

With [individualized] adjustment factors, we don't have this issue, becaue we can isolate out each hitter, and ask the question: "how has the park affected all OTHER LH hitters?". Then, you compare Ruth to this level.

While this won't affect the vast majority of cases, it will affect some oddball ones, and care should be taken to make sure that a player is not being compared to himself. I think the Pinto fielding model has this problem. MGL's UZR might have this problem to some extent as well, especially for a player like Andruw Jones who plays at the same park for half his games over several years.

--posted by TangoTiger at 05:46 PM EDT


Posted 5:47 p.m., December 26, 2003 (#1) - tangotiger (homepage)
  The original fanhome discussion can be found here.

Posted 8:44 p.m., December 26, 2003 (#2) - MGL
  MGL's UZR might have this problem to some extent as well, especially for a player like Andruw Jones who plays at the same park for half his games over several years.

Yes, I thought about that recently as well. I'm not sure it's a problem, though, when you are using home and road data to establish the H/R adjustment factors. Pinto's is a problem because he is just using the home data, which is of course comprised of 50% the home player or home team and 50% of lots of players who are presumably around league average. In fact, to estimate a PF, you don't have to use home AND road data. You can use just home data (which is what David is doing), as long as that home data is comprised of an unbiased large group of players. The only reason we use home AND road data in traditinal park factors is becuase we have a biased sample using only the home data (the home team hitters or pitchers or both). I'll have to think about your Ruth example and get back. I'm not sure there is a problem with it.

On a related note: You know how you (Tango) don't like idea of using park factors at all for "adjusting" (neutralizing or estimating what a player would do in a neutral stadium) a single player's stats mainly because you don't know how a stadium affects THAT particular player (not to mention the fact that true PF's are hard to estimate)? Assuming that parks affect different players in different ways to SOME extent (remember my little study on how a player's home/road splits regress almost 100%, which means that the unique affect of parks on individual players is SMALL?), let's say that our estimate of the true HR park factor for LHB's in Yankee Stadium is 1.20, and that is based on 20 years of data and then we regressed the sample HR PF appropriately. Now let's say that Ruth has a Yankee Stadium HR PF of 1.5. Doesn't that suggest that Yankee Stadium affects Ruth more than the average LHB, such that when adjusting Ruth's home stats, we might want to take a weighted average of the 1.20 and the 1.5 (maybe 95/5)? Or is Ruth's "extra" advantage already factored into the 1.20 since the 1.20 includes lots of Ruth HR's disproportionately to other players in the league (which is maybe a good reason why we SHOULD partially use a player's own home/road splits to adjust HIS stats)? OTOH, all of the Yankee players are disproportionately represented in that 1.20 (some more than others, depenmding on how many of the 20 years they played in). We don't want to "weight" Ruth's adjustment factor with all Yankee players. We want to weight it with Ruth's home/road splits. So maybe we do want to use 5 parts 1.50 and 95 parts 1.20 (or whatever combination) for Ruth and some other number than the 1.50 when we park adjust the home stats of other Yankee players.

On a related note, since weather can change the true PF of an outdoor park in any given year (not counting physical park changes) and so can changes in other parks in the league, when we park adjust say a player's or a team's 1999 stats, is it not better to use say 10 years of data, but with the 1999 data more heavily weighted, especially for outdoor parks?

I can't believe you actually posted that MLE article! You went from like Sabermetrics 101 to graduate level STATS 581!

BTW, Tango if you come up with any more provocative topics, please keep them to yourself. My head is about to explode! I actually have real work to do and a book to write! Thank God I'm stuck in Rochester, NY, in the winter with nothing to do!

Posted 10:43 p.m., December 26, 2003 (#3) - tangotiger
  I can't work at home with my baby... what else am I supposed to do?

Just to clear it up, MGL did not characterize my position on park factors. I hate the way people interpret park factors, and not that people use park factors to begin with. The park factors, as currently done, are only a first step. Because of they way they are used, people treat them as the final step.

Posted 12:43 a.m., December 27, 2003 (#4) - Virgil
  If I'm not wrong, it wouldn't be so hard to create these separate set of Park Factors v.1.1

I was always interested in comparing Lefty/Righty park factors, although I'm not sure sample size will permit it. It goes with the whole weird dimensions aspect of ballparks - e.g.. Righty pull/gap hitters did terribly in the 40s, 50s because of the cavernous left-center power alley, whereas Lefties had a considerate advantage. When factoring in the park factors, righties would be hurt, and lefties would be helped.

I'm curious though, do you two (MGL and Tango) think a) this is feasible, b) sample size would permit it, and c) whether something like James' five year window, with 5x emphasis placed in the middle year would help diminish sample size errors.

Probably a dumb question/post

Posted 9:12 a.m., December 27, 2003 (#5) - tangotiger
  I would use ALL years for a park. For example, the dimensions of Wrigley has not changed, right? Well, go back to the very beginning to establish the Wrigley factor.

However, the types of players HAS changed probably. As well, the other parks being compared to have changed.

And of course, the weather would apply only to the given year. So, you have to figure out which parks have its climate change the most year-to-year.

There's tons of stuff to consider, which is why I say all this is only a first step.

Posted 8:57 p.m., December 27, 2003 (#6) - MGL
  Tango is correct. You want to use ALL the years you can for a park that hasn't changed. As far as the "other parks," changing in the meantime, you can either: a) live with it, b) try and adjust for it using some sort of a "strength of schedule" iterative process, or c) put more weight on the year withing which you are adjusting. For weather, you can do a) or c).

How much you want to break a park up into its components is personal preference, based on time, energy, availablity of data, etc. The mnore information, the better - ALWAYS, as long as you know how to use it in a reasonable fashion. You can always find a way for MORE data to benefit your model, even if you don't know what the hell you are doing and your sample sizes are very small.

The most coarse kind of adjustment is a "whole park" run factor. If it is a symmetrical park, don't worry TOO MUCH (there are still wind issues, for example Wrigely Field benefits balls to left, i.e., RHB's much more so than balls to right, because of the wind) about the R/L thing. If it is a non-symmetrical park, then perhaps you might want to break that "whole park" run factor into one for RHB's and one for LHB's. It's up to you. Ditto for breaking up run factors into component factors - HR factors, triples factors, etc., or BA, OPS, etc.

Personally ,I use compoent park factors for everything I can think of, for all parts of a park. Fly balls to left, right and center, bunt atttemtps, ground balls hit to the infield (how often they go through or not depending upon the speed of the surface), foul balls (sixe of foul territry, etc. If you do that, you have to be careful that you regress those component park factors aggressively and corrrectly before you apply them to a batter's or a team's stats to estimate what that batter would do in an unknown, neutral park, since the individual sample park factors are going to be necessarily small. Also, rather than regressing everything to 1.00, sicne you know lots of things about a particular park, you want to regress sample component PF's to something other than 1.00, depending upon those "things," like altitude, average temp and wind, size of the OF and wall heights, size of foul terrory, kind of surface (grass, astroturf, Nex-turf), etc. Kind of tricky but works great!