Park Factors (March 18, 2004)
NC (Seattle): ... At what point do Barry Bonds' outlandish numbers skew the Park Effects of the places he plays? Does Pac Bell rate as the best pitcher's park in the game if you remove Barry's PAs? And since he must rack up 40 PAs or so in Dodger Stadium a year, is Dodger Stadium even tougher on "normal" lefties than it appears?
NC is on the money here. I've been talking about this for years.
Whether it's Bonds, or McGwire, Jack Clark, Babe Ruth, or anyone else that is such an outlier from the rest of his team, this is a problem. Park effect (and any other adjustment factor) will always assume a certain talent distribution, so that we don't need to worry about the particular players. But, how can you possibly have an accurate LH park adjustment at Yankee Stadium in the 20s or Pac Bell today, when you've got such outliers like Ruth and Bonds?
How a park effects a player is DIFFERENT for every single player out there. However, for most players, that difference is very slight. But, that's not always the case.
Busch Stadium does not affect Vince Coleman, Willie McGee, and Jack Clark the same way. Nor should we pretend that it does. If you insist on using a park factor that is the same for all players, you have a responsibility to tell your readers what the confidence interval is for that park factor. (I'm saying this to MGL, BP, STATS, and anyone else that puts out PFs.)
If we assume that a player named Hubie is perfect average, and will perform perfectly average in all parks, it's quite possible that Mike Piazza' true talent level is +50 runs above Hubie at Dodger Stadium, but +40 runs above Hubie at Fenway. Rey Rey could be -20 runs at Dodger Stadium and -20 runs at Fenway. There's no reason to think that all players are equally impact at all parks.
It's about leverage. What skills are most leveraged at what parks. This is true for hitting. You can also ask that for fielding: what player traits are most leveraged at what position.
--posted by TangoTiger at 03:18 PM EDT
Posted 4:36 p.m.,
March 18, 2004
(#1) -
Alan Jordan
Is it me or does that link go to another discussion?
Posted 4:44 p.m.,
March 18, 2004
(#2) -
tangotiger
The quote was from Dayn Perry's chat.
Posted 4:53 p.m.,
March 18, 2004
(#3) -
Hatrack Hines
(homepage)
Voros did some work on component park factors (see homepage) which would seem to address part of what Tango's pointing out. If you used component park factors instead of run-based park factors, you'd see different kinds of hitters being affected differently. Thing is, looking at Voros' numbers, it looks like there just isn't enough significant difference in component factors to explain variation in scoring between parks. Or am I misreading something?
Posted 5:43 p.m.,
March 18, 2004
(#4) -
tangotiger
It needs to go one step further: by profile and quality of batter, what's the park factor.
Given a strong, LH, FB, good hitter, what's his PF at Pac Bell? Shea? Fenway?
Posted 6:48 p.m.,
March 18, 2004
(#5) -
MGL
Tango, as you know, I now use park factos that are even more granular than Voros' (or anyone else's) component factors. I use home runs to left, right and center, GB "speed" factors thru the IF, foul out factors, IF hit factors, bunt single factors, etc.
How much do you think this addresses the issues you are talking about? I would think a lot.
Posted 10:26 p.m.,
March 18, 2004
(#6) -
tangotiger
Are your HR factors a function of "long fly balls hit", or a function of HR hit? This makes a HUGE difference. As well, the factors should be based on ratios and not rates.
For example:
Olerud: 10 HR, 100 long flys
Bonds: 40 HR, 100 long flys
If Coors turns a 15/100 hitter into a 20/100 hitter, that's a factor of:
20/80 divided by 15/85 = 1.4
So, Olerud is: 10/90 x 1.4 = ratio of .156 = rate of .135 = 13.5 HR per 100 long flys
Bonds is 40/60 x 1.4 = ratio of .93 = rate of .48 = 48 HR per 100 long flys
So, Olerud's rate was increased by 35%, while Bonds' was increased by 20%.
Posted 10:52 p.m.,
March 18, 2004
(#7) -
tangotiger
And the other more larger point is Barry Bonds. What is your component PF for LH at Pac Bell? Now, can you redo your calculations, but this time removing Barry Bonds?
These park factors only work when you have a typical distribution of players. I think it's fairly certain that the LH at Pac Bell don't follow that typical distribution, when 15% of those LH PA goes to the best hitter in the league.
How can you possibly calculate a LH Pac Bell park factor that includes 15% Bonds PAs, and then apply that PF to Bonds himself?
Can you imagine doing a LH HR park factor for the 1920 NYY? What is it... 75% of the LH HR are hit by 1 guy?
Posted 11:03 p.m.,
March 18, 2004
(#8) -
MGL
How can you possibly calculate a LH Pac Bell park factor that includes 15% Bonds PAs, and then apply that PF to Bonds himself?
Actually, there is nothing wrong with that at all, given a large enough sample. With large enough sample, any given player's actual home/road splits ARE his exact PF's.
Are your HR factors a function of "long fly balls hit", or a function of HR hit? This makes a HUGE difference
You make an assumption which it turns out is not true! I've done it both ways and there is very little difference.
We've had this debate a million times before, but once you use many years of data and regress, there simply isn't that much difference between one method or another (for example whether you use the odds ratio or the "rate" method).
The bottom line is that you adjust as best as you can for the dimensions, altitude, prevailing weather, foul territory, lighting, and playing surface of a park and you apply that to the appropriate components for a player and if you are "conservative" you simply end up with a better "number" than when you started...
Posted 11:57 p.m.,
March 18, 2004
(#9) -
Alan Jordan
I did a linear regression with OBA (each row is a separate PA) with independent variables for the park (hometeam and visiting team for each park for 58 parameters) batter handedness and the interaction of batter handeness and park (another 58 parameters). I estimated the park factors and then redid the regression, this time adding in parameters for hitters, pitchers and defensive team. I estimated the park factors from this second regression and compared the two sets of park factors.
THe top five changes in park factors are
CHA Home Left 0.040
ARI Vis Right 0.039
SFN Home Left -0.037
COL Home Left -0.034
BOS Vis Right 0.030
San Fran/Home/Left handed is number three out of 116. The numbers on the left tell you how much the park factors went up by controlling for who was batting and pitching along with the defensive team. The data was 1999-2002 and there has been no regression to the mean specifically done to the park factors.
Since it's a linear model the park factors themselves are additive instead on multiplicative or odds ratio based. I've been playing around with using the linear model to approximate the logistic (odds ratio) and it turns out that with baseball data the hypothesis tests and predicted values match pretty damn well and the linear runs sooooo much quicker.
In summary, San Fran's home lefthanded park factor seems to be inflated by .037 which supports Tango's argument to some degree.
[an error occurred while processing this directive]
Posted 8:34 a.m.,
March 19, 2004
(#11) -
tangotiger
but once you use many years of data and regress
But, since Pac Bell opened, hasn't Bonds continued to make up 15% of all LH PAs there?
You can take Yankee Stadium from 1920 to 1931, and I'd guess that over 50% of LH HR were hit by 1 guy. If, for example, Ruth hit 300 HR at Yankee Stadium and 200 away, and his LH teammates his 300 HR at Yankee Stadium and 400 away, you would conclude that Yankee Stadium has no HR PF for LH.
But, this is not a representative sample of MLB players, since Ruth makes up 50% of the sample. Furthermore, the more PAs you have of Ruth home and away, the less you need to know about the PF for the non-Ruth players.
Posted 8:35 a.m.,
March 19, 2004
(#12) -
tangotiger
Alan, thanks for that great insight. I'd love to see an article on your findings, and I'd be glad to post it here, if you like.
Posted 9:28 a.m.,
March 19, 2004
(#13) -
tangotiger
Btw, it's a given that using the additive, multiplicative, or odds ratio method will give you similar results, on average. After all:
1 - these adjustment factors, on average, are small to begin with
2 - there are an enormous number of players clustered to the mean
But, when it comes time for Bonds and Pac Bell, the extreme cases, the cases we most care about, we should do it the right way.
Posted 5:25 p.m.,
March 19, 2004
(#14) -
MGL
You can take Yankee Stadium from 1920 to 1931, and I'd guess that over 50% of LH HR were hit by 1 guy. If, for example, Ruth hit 300 HR at Yankee Stadium and 200 away, and his LH teammates his 300 HR at Yankee Stadium and 400 away, you would conclude that Yankee Stadium has no HR PF for LH.
To some (perhaps a large) extent, that's the result you want! While individual players do indeed have unique park factors, they also don't have unique park factors, such that we gain useful information (sample size) about every player by combining data from all players. Using your argument, we wouldn't want to combine the data from 100 different players since each of those players has their "own" park factor." If you are going to think of park factors that way, then you might as well not do them at all, other than using a player's own home/road splits and then regressing, or just using a player's road stats (adjusted for HFA) as a park neutral estimate of his true talent. Which actualy brins up an interesting question. For a player who plays in an arguably "unusual" park, and has a long history (say more than 15 years of data) which is a better estimate of his true park-neutral talent, his road stats only adjusted for HFA or his total stats with the home stats park adjusted (using some crappy adjustment formula)? Plus the larger the player comprises the data in a certain park (like HR's, Ruth and Yankee Stadium, the more the park factor is simply that player's PF, and the more that player's park adjusted stats are really his road stats only. Remember that if we use a player's own non-regressed splits to adjst his own home stats, we simply get his road stats (and end up completely ignoring his home stats, which can't be right unless we have a huge sample of data for that player)....
Posted 9:23 p.m.,
March 19, 2004
(#15) -
Alan Jordan
Thanks for the offer Tango, and I may take you up on it in a week or so.
The problem with "doing it right" is that it takes so long to run. It can take 3 or 4 hours to estimate one model that contains batters and pitchers. On DIPs models where you need to add in the defensive team on top of that, it can take over 12 hours to run. That means you can spend a week doing a single specific hypotheis or set of models.
Applying park factors on a PA level like I'm doing, requires knowing the mix of PAs by park. That can known exactly if you have the data, but can only be approximated for the future. That adds a layer of noise.
There is also the problem of players who have rates of 0 or 1. When these are present in logistic or probit models, they give infinite paramaters and contaminate the hypothesis tests. Such players can be dropped or added together with other players with limited PAs ... or you can use some kind of bayesian model with priors that will take literally weeks to estimate on a home computer. If you had asked me this summer whether using odds ratios was important, I would have said absolutely yes, but I am slowly becoming disabused of that idea. I think that the small differences in talent are mostly obscurred by the fog of chance.
Posted 10:35 a.m.,
March 20, 2004
(#16) -
tangotiger
Using your argument, we wouldn't want to combine the data from 100 different players since each of those players has their "own" park factor."
No, my argument is that the players in your sample should be representative of the MLB population. The 1920s LH Yanks, and the 2000s Giants are not.
Posted 7:49 p.m.,
March 21, 2004
(#17) -
The Other Kurt
Schindleria asks: Is the solution to use only visitors' stats to calculate park factors?
Maybe one of the SABR-gods lurking around this thread can answer better, but I'll take my stab. Basically, no that's not the answer. Were you to do that two things would reduce the benefit you recieved by using a more divers batting group. (1) The smaller sample size. While you just reduced the concentration of ABs among certain hitters, you just halved your samnple size. And perhaps more imprtantly (2) by increasing the diversity of the batting group, you just reduced the diversity of the pitching and defence group. Were you to use only visitors stats, the vast majority would be against the same 5 pitchers and the exact same defence. So while the numbers would be less affected by home team hitters, it would be greatly MORE affected by home team pitching and defence.
I am not very sophisticated at how these biases weigh against each other, but my guess is the increased bias on pitching and defence would balance the reduction in bias on hitting, and the smaller sample size would make this method not woth using.
Anyone who's done the work want to chime in?
Posted 9:24 p.m.,
March 21, 2004
(#18) -
MGL
The other Kurt, yes, the sample size is an issue, but the pitcher/batter/defense bias is not. Pitchers and fielders have very little influence on park effects as compared to hitters. So yes, if there were one or two players who dominated a component of a particular team's offense, it would NOT be a bad idea to just use the visiting team's data. The problem with using only the visitor's stats is that you would have road stats in a park compared to home stats for the rest of the league. So the ratio or odds ratio or whatever method you used to calculate the park factors would end up having the HFA built in to them, which would be extremely problematic. You actually have a similar although not quite so dramatic problem with regular park factors (when you use both teams's stats), and that is if a park has a quirk about it that gives it a greater or lesser than average HFA, that gets built into the regular park factors as well. In fact, you can create 2 park factors for each park - one for the home team and one for the visiting team. That is definitly true for Coors Field. IOW, park factors and HFA are inextricably related...