Tango on Baseball Archives

© Tangotiger

Archive List

Baseball Graphs - Money and Win Shares (November 28, 2003)

Studes continues...
--posted by TangoTiger at 11:36 AM EDT


Posted 12:22 p.m., November 28, 2003 (#1) - David Smyth
  ---1."Virtually every successful team, even the Yankees, had players who were paid less than $1 million but who delivered significant value. Without these players, you can't buy your way to the top."

So a team essentially needs to develop some players who are good enough to significantly contribute but young enough to have low salaries. A few good, young players. Also, maybe some journeyman veterans who can contribute in a platoon, or solid, veteran lower-salaried middle relievers.

---2. "You will make mistakes with your large contracts. You should expect them."

So a team should be conservative with large contracts, and only offer them for players who are still in their 20s and are "proven" stars, particularly at up-the-middle positions.

3. "Pitchers are much more likely to be expensive mistakes than everyday players."

So don't sign a pitcher to more than a 3yr contract, and look closely at his past durability and work habits.

Posted 12:44 p.m., November 28, 2003 (#2) - Tangotiger
  The way I calculate "earned" salary is wins above replacement times 1.5 to 2 million$ + 0.3 million$.

So, an average team with 30 players will have about 30-35 wins above replacement, which works out to a total team salary of around 70 million$.

I'm not sure how different this is from this Win Shares process.

I don't think I yet believe in the extra value paid to the extreme player. This may be a byproduct of being a free agent or not.

Perhaps, studes can generate TWO different lines, one for those players who are 6-yr and over players (i.e., paid the fair market value), and the rest. I'll bet that you'll get 2 lines that will make far more sense and are far different (instead of the current 1 line that averages these 2 lines).

Posted 11:06 p.m., November 28, 2003 (#3) - Alan Jordan
  The regressing of net winshares by salary and then taking the residual seems completely unncessary. Multiply Wins Shares by $300,000 to put them in terms of dollars and then subtract salary and you have net value added to the team. You could also divide win shares by salary or wins above replacement by dollars above replacement. Any of these will give you a valid version of productivity in relation to salary. If I read the article correctly, this was done so that we could evaluate GMs, but since these methods give you productivity in relation to dollar, you're already there before you do the regression.

Also if you define value as benefit-cost or benefit/cost, you shouldn't be running regressions with value as the dependent variable and cost as the independent variable. Cost is explicitly stated in the dependent variable (value) and this regression will produce always produce a negative r by definition.

For example, I took this data created a normally distributed random variable with a mean of 0 and standard deviation of 1,000,000 and then substracted salary from it. The correlation between this number was -.95.

Posted 8:12 a.m., November 29, 2003 (#4) - studes (homepage)
  Alan, thanks. Maybe you can help me think this through. I did the first step you mention (WS*$300K minus salary) to derive "net value added to the team." I noticed that net value decreased as salary increased. Are you saying that this must be true, given the way I defined value? Is there a better way to define value?

On a team level, there is a fairly straightforward negative relationship between value and payroll. The team with the second-most value was Tampa Bay. Fourth was Milwuakee. This didn't seem like the most helpful analysis to me.

So I derived a formula that "best fit" the data. Frankly, I didn't pay any attention to the r or r squared, cause I wasn't interested in fit. I wanted a formula that best described the negative slope between salary and value, to better evaluate the GM, as you say. Was it inapproriate to try and capture the negative slope between salary and value, given the way I defined value?

Thanks again for your help.

Posted 1:29 p.m., November 29, 2003 (#5) - Scoriano
  Studes, any luck on Pete providing playoff data for Win Shares?

Posted 3:04 p.m., November 29, 2003 (#6) - studes(e-mail) (homepage)
  Scoriano, sorry. Pete doesn't have time to do that right now, and I'm not going to tackle it myself. There are also some issues deciding how to set certain baselines -- particularly for fielding purposes -- with postseason data. Not as easy as just loading the data.

If you feel up to it, I'd be happy to e-mail you Pete's basic spreadsheets, and you could fill in the data. I think Patriot also has a Win Shares spreadsheet on his site. I'm guessing that Patriot's spreadsheet is more user-friendly.

Posted 3:23 p.m., November 29, 2003 (#7) - Scoriano
  Studes, thanks. I have Patriot's sheets. I may tackle offense only.

Posted 10:06 p.m., November 29, 2003 (#8) - Alan Jordan
  I listed out three ways of doing it, the first you already did and one was Tango's. I don't know if one is any better than the other.

The way you have defined value will by definition give you a positive correlation between win shares and value. At the same time there will also be negative correlation between salary and value. Also if you put winshares and salary into a regression to predict value, you should get an r-square of 1, meaning perfect prediction.
every correlation implies a linear equation like Y=M*X+B+E.

Y is your dependent variable
M is the slope
X is the independent variable
B is the y-intercept or constant
E is the error (all omitted or mispecified variables)

in this case X is salary and E is winshares. If winshares and salary were uncorrelated (this isn't true), then M would be 1 and B would be 0.

As for how to get what you want, I suggest you take the team data that has winshares and salary and fit a logistic (or probit) regression through it. The logistic function has the nice property of the predicted value not going below 0 and not going above 1 (it has to be between 0 and 1). It follows an S like curve that is usually approximately linear between .3 and .7. This is probably what you want to look at because you should get progressively less increase in wins as you spend more.

I took a look at 2003 data from espn and correlated salary to win percentage. The logistic function only slightly outperformed the linear so I went with the linear. I did a linear regression where win percentage was the dependent variable and salary was the independent variable. I then created a varable called GM which was simply the residual , win percentage-predicted win percentage.

Oakland came out on top, followed by Toronto, Florida was 3rd and Atlanta was 4th. Detroit outsucked NYMets by a 9% wp margin for the tittle of salary misallocation champions. Here is the whole table.

Obs team gm

1 Oakland 0.12204
2 Toronto 0.09450
3 SanFranc 0.08703
4 Florida 0.08588
5 Atlanta 0.07297
6 Minnesot 0.05571
7 KansasCi 0.05047
8 Montreal 0.04649
9 Seattle 0.04299
10 Houston 0.03818
11 Boston 0.03366
12 ChicagoS 0.02888
13 Philadel 0.02650
14 ChicagoC 0.01347
15 Arizona 0.00585
16 Pittsbur 0.00340
17 St.Louis -0.00113
18 Milwauke -0.01708
19 NYYankee -0.01852
20 Anaheim -0.02613
21 Colorado -0.03049
22 TampaBay -0.03357
23 Clevelan -0.04832
24 LosAngel -0.05128
25 Cincinna -0.05947
26 Baltimor -0.06183
27 SanDiego -0.06843
28 Texas -0.07044
29 NYMets -0.11967
30 Detroit -0.20166

I don't fully trust the salary data from ESPN for a couple of reasons, 1st it was opening day so if a player was traded, his salary was attributed completely to his first team which probably under represents the salary of teams like the Yankees. 2nd Mike Hampton's 12 mil salary was attributed entirely to Atlanta even though Colorado and Florida were paying most of it this year. So who knows how accurate it is.

Anyway you can take your team winshare and salary data and do the same thing. If you post it on your site, or at fanhome, I'll run it for you.

Posted 6:55 a.m., November 30, 2003 (#9) - studes (homepage)
  Thanks, Alan. I appreciate the comments and work.

I went back to my data and looked at my basic Win Shares vs. Salary regression. The basic formula was WS = 4 + 1.2*$1M. So you would expect, on average, 16 Win Shares for a player who was paid $10M. This doesn't deviate a lot from my formula, so I think my graphs are still appropriate (if not the underlying approach).

When you say that, by definition, I will get a negative correlation between salary and value, I don't understand why. If you're saying that this is true, given baseball's salary structure, I can understand that. But I posit that there are a lot of industries in which the correlation would be positive. And this is an important insight.

Imagine an industry with a high learning/experience curve, some sort of specialized work. A good example would be baseball without the minor leagues. You might pay an entry worker $40,000 and literally receive no value in return, because that person is learning their craft.

Over time, as they continue to learn their craft, they start to contribute and their "value", as defined in my approach, begins to rise.

But, depending on the rate of increase in salary vs. contribution, that person's "value" might actually increase over time, and over different salary levels.

In a theoretical perfect labor market, with no inherent learning skill issues, value would remain constant across salary levels.

But baseball is different. My articles are about trying to determine how it is different, and what some of the implications are.

Posted 6:58 a.m., November 30, 2003 (#10) - studes (homepage)
  By the way, I posted a link to your comments and analysis from my site. I hope you don't mind.

Posted 10:19 a.m., November 30, 2003 (#11) - Alan Jordan
  "Cost is explicitly stated in the dependent variable (value) and this regression will produce always produce a negative r by definition."

That's wrong. That should read:

"Cost is explicitly stated in the dependent variable (value) and this regression will produce always produce a MORE negative r by definition."

A negative r is only guarenteed when when benefit and cost are uncorrelated, and that's an extremely abnormal scenerio.

Actually your regression of value on salary may have a use as a test of market efficiency. Let me think on this.

Posted 10:54 p.m., November 30, 2003 (#12) - Alan Jordan
  O.k. If the correlation between value (productivity -cost) and cost is negative then should mean that people are on average overpaying for productivity. If its positive then people are on average underpaying. If its 0 then people are paying the right price.

Statisticians will cringe at having cost on both sides of the equation, but in this case, I think it's o.k. In general avoid it if you can.

As for linking my quotes to your site, I wasn't sure where to post it anyway. You can quote/link anything I post.