Linear Weights

From Wiki
Jump to: navigation, search

Linear Weights (LW) is a term used broadly to refer to any linear run estimator, and also to the analytical system of Pete Palmer (see Linear Weights System). The pioneer of Linear Weights was Canadian sabermetrician George Lindsey, but the concept was expanded upon and popularized with Palmer's Batting Runs.

Methods for Generating Linear Weights[edit]

While all LW formulas are alike in that they place a first-order coefficient on a number of offensive events, the means by which those coefficients are generated varies.

Empirical Approach[edit]

The empirical approach to Linear Weights is closely related to the concept of Run Expectancy. To generate the weights, some sample of data (often all plays in a given league-year, or over the course of several years) is analyzed. The change in run expectancy on each play is calculated as follows:

Change in RE = Final RE - Initial RE + Runs Scored on play

For example, take the case of a grand slam with 2 outs, using this RE Table. The initial RE is for the bases loaded, 2 out state (.815 runs). The final RE is for the new state, which is bases empty, 2 outs (.117 runs). Four runs scored on the play, and thus the value of the play was .117 - .815 + 4 = 3.302 runs.

After doing this process for each play, the results are averaged to produce the Linear Weight values. This procedure will result in out values that will estimate runs above average (or in other words, the sum of the product of the coefficients and the frequencies of each event will be zero). In order to estimate the total number of runs scored, to the coefficients of events which include outs (i.e. a strikeout, caught stealing, double play, etc.) must be added 1/3 of the expected run total for the inning (equivalent to the bases empty, no outs run expectancy) for each out on the play.

Intrinsic Weights Based on Dynamic Run Estimators[edit]

Dynamic run estimators differ from linear run estimators in that they do not place a fixed coefficient on each event, but rather attempt to model the run scoring process. Thus, the value of each event varies based on the frequency of other events.

However, for any given set of input statistics, the intrinsic linear weight that the dynamic estimator places on a given event can be determined. If one trusts that the dynamic estimator being used is a good model, then the linear weights it generates for the inputs could be valuable.

Various approaches can be used to determine the intrinsic weights. The so-called "+1 method" adds one of a given event (i.e. one walk or one double). The difference between the output of the estimator with the additional event minus the output without it is the linear weight for that event. More precise estimates can be generated by adding smaller increments (for example 1/100th of a walk), finding the change in estimated runs scored, and dividing by the size of the increment added. The smaller the increment, the more accurate the estimate because each change in the inputs changes the system ever so slightly.

For dynamic estimators that can be written as simple formulas, the formula for the intrinsic weights can be found by partially differentiating the equation with respect to each event. The partial derivative is a calculus concept which finds the change that would be created by adding an infinitesimal amount, eliminating the effect of changing the system.

The intrinsic weights found through the Base Runs estimator, as well as those from Markov models of run scoring, are the ones that are most often used by sabermetricians, since those models work over a wider range of contexts than other dynamic estimators like Runs Created.

Multiple Linear Regression[edit]

Linear weights are sometimes generated by running multiple linear regressions to predict runs from the various offensive inputs. This is usually done on team seasonal data, although it could be done on game or inning level data too.

The drawback of regression is that it is a purely mathematical procedure, and the results do not always conform to what logic or other means (such as empirical linear weights) tell us to be true about baseball. The correlation between an event and runs sometimes does not reflect the impact that it has upon runs. For example, take this regression on team season data from 1954-1999 found in Jim Albert and Jay Bennett's Curve Ball:

R/G = (.49S + .61D + 1.14T + 1.50HR + .33W + .14SB + .73SF)/G

A double is only seen to contribute .61 runs, well below the .8 usually found through other procedures. Additionally, sacrifice flies are valued at .73 runs. This result is not surprising when one considers that sacrifice flies always result in runs. However, as observers we know that while the sacrifice fly contributes to the run, the more important element was the events that allowed a runner to reach third base with less than two outs. Albert and Bennett explain that sacrifice flies are a "carrier" category, meaning "[They] may carry more information than their literal name implies."

The choice of categories in a regression often affect the coefficients as well. It is not unusual for a regression using Total Bases and Hits to give coefficients for hit types more in line with our expectations than a regression using singles, doubles, triples, and home runs as separate inputs.

Simple Models[edit]

In lieu of play-by-play data or using intrinsic weights, several methods for producing approximate linear weights for different contexts have been created. These approaches rely on assumptions about the relationships between the value of offensive events that are fairly valid within the normal range of team contexts. While they may not work well when applied to theoretically extreme teams (for example, nine Babe Ruths), they can be used to generate reasonable weights for normal teams and leagues.

Both David Smyth and Tangotiger have published these types of models. Smyth's begins with the premise that each on base event is worth the average number of runs scored per baserunner (approximated as (R - HR)/(H + W - HR)), and proceeds to use various assumptions to estimate each event's value in terms of advancing baserunners. Combining these two values gives an overall coefficient for each event.

Skeletons and Trial and Error[edit]

Skeletons refer to an equation that is crafted based on relative weighting of offensive events, which is then multiplied by a constant in order to estimate runs. An example of an estimator developed by a skeleton approach is Paul Johnson's Estimated Runs Produced. Johnson used play-by-play data to determine the average number of bases gained on hits and walks, then experimented to find a value for outs and found a constant (.16) which would bring his equation in line with runs scored.

In the case of ERP, the logic used to create the skeleton was similar to that of the Run Expectancy approaches described above, since both relied on examination of play-by-play data. However, Jim Furtado's approach in developing Extrapolated Runs was a hybrid. Using ERP as a starting point, he also considered regression results and experimented until he found a formula that he felt made common sense and had superior accuracy when applied to his sample data.

This family of techniques is often criticized because they are seen as forsaking the theoretical soundness of empirical approaches in pursuit of more accurate predictions with sample data.

Examples of Linear Weight Estimators[edit]

Below is a list of linear run estimators commonly used by sabermetricians. However, it should be noted that linear weight methods often do not have unique names as they are tailored to a specific environment or context. The methods below use long-term average values and are generally not designed for any specific context.