The Relationship Between Runs and Wins

1

The relationship between runs and wins is one of the most (if not the most) important in baseball research. It provides the quantitative connection needed to answer fundamental questions. In addition, considering its development reveals that only a few (basic) variables (that go into it) provide significant insight into the game of baseball. Despite these points, this relationship is often misunderstood, and its utility is often misused.

In this Article, the relationship between runs and wins and its utility are considered in detail.

Note that the historical development of this relationship (including alternative formulations, etc.) is considered in a separate article.

The Relationship Between Runs and Wins

Mathematically, the relationship f between runs and wins can be specified by the following expression:

(1)   \begin{equation*} W = f(G, RS, RA) \end{equation*}

where W is the number of wins, G is the number of games, and RS and RA are the runs scored by a given team and allowed by their (collective) opponent.

There is only one constraint that the relationship must satisfy:

  • W/G is bounded with the range [0,1].

There are several properties though that it should have.

The Run–Win Relationship

Linear Models

Basic insight into the run–win relationship can be obtained by noting that for a single game, a win or loss is specified by the following conditions:

    \begin{eqnarray*} RS > RA ~~~ (\text{win}) ~~~ ~ \\ RS < RA ~~~ (\text{loss}) ~~~ . \end{eqnarray*}

Over the course of many games, a natural and plausible assumption is therefore that the (total) run differential RD,

(2)   \begin{equation*} RD = RS - RA ~~~ , \end{equation*}

will be strongly correlated with the number of wins (this is exact for a single game, noting the prior conditions). This validity of this assumption can be verified by calculation.

Normalizing per game, the above assumption implies a strong linear correlation between RD/G and W/G. This means that a linear model (e.g., Ref. [1]) should be quite accurate,

(3)   \begin{equation*} \frac{W}{G} = a \frac{RD}{G} + 0.5 + \epsilon \end{equation*}

where a is the slope of the line (based on the correlation, which we know to be positive), 0.5 is set to ensure the condition for RD/G = 0, and \epsilon is the error term.

The fundamental problem with linear models is that they aren’t bounded, for

    \begin{eqnarray*} RD/G < {-0.5/a} ~~~ W/G < 0 ~~~ ~ \\ RD/G > 1.5/a ~~~~~W/G > 1 ~~~ . \end{eqnarray*}

Pythagorean Expectation

An improved run–win relationship can be derived under two assumptions:

  • The “quality” q of a baseball team is measured by the ratio of RS to RA,

(4)   \begin{equation*} q = \frac{RS}{RA} ~~~ . \end{equation*}

  • Baseball teams win in proportion to their quality.

Note that the first assumption is an extension of that associated above with Eq. (2, and will be considered below (the run environment)); the second will also be considered below (the importance of chance).

Consider Team A that plays against a (collective) Team B. The above assumptions lead to the probability that Team A wins is then

(5)   \begin{equation*} \frac{W}{G} = \frac{q_A}{q_A + q_B} \end{equation*}

where q_A and q_B are defined by Eq. (4); note that the latter (from the perspective of Team A) must necessarily be

    \begin{equation*} q_B = \frac{RA}{RS} ~~~ . \end{equation*}

Note that care must be taken (as above) to interpret the latter only as the collective opponent [2]. Only in this case does

    \begin{equation*} \frac{q_A}{q_A + q_B} +\frac{q_B}{q_A + q_B} = 1 ~~~ , \end{equation*}

and the above probabilistic interpretation hold.

Inserting q_A and q_B into Eq. (5) gives

(6)   \begin{equation*} \frac{W}{G} = \frac{RS/RA}{RS/RA + RA/RS} = \frac{RS^2}{RS^2 + RA^2} ~~~ . \end{equation*}

This equation is recognized as the “Pythagorean theorem” developed by Bill James [3], or more commonly known as the Pythagorean expectation.

An important improvement relative to the linear model is that the function in Eq. (6) is bounded with the range [0,1], satisfying the constraint.

Note that a multivariate Taylor series expansion (to first order) about (RS, RA) results in Eq. (3) [4].

The Importance of Chance

While the second assumption underlying to the Pythagorean expectation is plausible, it is not natural. This is because the extent to which it is valid is dependent on the importance of chance.

There are several ways in which to correct Eq. (6).

The most common approach is to consider a fixed, but (possibly) different exponent,

(7)   \begin{equation*} \frac{W}{G} = \frac{RS^k}{RS^k + RA^k} ~~~ , \end{equation*}

known as the fixed-exponent model. Note that for k = 2, this expression reduces to Eq. (6). That commonly used [5] is k = 1.83.

Note that the above approach is ad hoc. However, Eq. (7) is found [6] to be theoretically justified, by modeling the number of runs scored and allowed as independent random variables drawn from some distribution (which inherently captures the importance of chance).

The Run Environment

One thing missing in the Pythagorean expectation is that it does not consider the run environment. This result initiates with the assumption in Eq. (4), which is insensitive to scaling.

The run environment, however, is directly related to the importance of chance (discussed above). The higher the margin of victory (or defeat) (per game), the less likely that the result was due to chance. Indeed, considering these margins in a win expectancy model  [7] reveals this to be the case (and why different sports yield different results).

Under the exponent-correction approach (to chance), it makes sense to there consider such a correction.

Over a wide range of values,

(8)   \begin{equation*} k = RPG^{0.287} \end{equation*}

where RPG is the runs per game (of both teams),

(9)   \begin{equation*} RPG = \frac{RS + RA}{G} ~~~, \end{equation*}

is found to give the best answers; this includes the mandatory value of k = 1 at 1 RPG. This formula is known as the Pythagenpat formula.

The Utility of the Run–Win Relationship

The run–win relationship is often used [5] to predict the expected numbers of wins and losses; leading to the notion of whether a team was “lucky” (or “unlucky”). While there is some utility in this, it detracts from its fundamental utility of this relationship [8].

The goal of a baseball team (over a game, or course of games) is to win games. Questions in baseball research are therefore fundamentally concerned with the importance of a particular quantity in terms of wins.

It is impossible to quantify the importance of any particular quantity directly in terms of wins. It is possible, however, to directly quantify such in terms of runs. The run–win relationship therefore provides the connection needed to answer the fundamental questions.

Conclusions

The relationship between runs and wins [Eq. (1)] is simply that, a relationship. Considering its development in detail (as in this Article) shows that it nonetheless provides significant insight into the game of baseball.

This relationship provides insight into:

The utility of this relationship then that it provides the quantitative connection needed to answer fundamental questions in baseball research. The most accurate of these being Eq. (7) with the exponent given by Eqs. (8) and (9).

Given that the run–win relationship is based on several assumptions though, there is likely room for improvement. Indeed, as indicated by recent references (e.g., in this Article, Refs. [2, 4, 6, 7]), it constitutes an active area for current and future research.

References

[1] One of the earliest and perhaps most famous linear models: A. Soolman, Unpublished.

[2] J. Heumann, “An improvement to the baseball statistic “Pythagorean Wins”,” Journal of Sports Analytics 2, 49–59  (2016).

[3] B. James, Baseball Abstract (Ballantine Books, 1983).

[4] K. D. Dayaratna and S. J. Miller, “First-Order Approximations of the Pythagorean Formula,” By The Numbers — The Newsletter for the SABR Statistical Analysis Research Committee 22, 15–19  (2012).

[5] Baseball-Reference; Accessed: 2018-03-18.

[6] S. J. Miller, “A Derivation of the Pythagorean Won-Loss Formula in Baseball,” CHANCE 20, 40–48 (2007).

[7] E. H. Kaplan and C. Rich, “Decomposing Pythagoras,” J. Quant. Anal. Sports 13, 141–149  (2017).

[8] FanGraphs; Accessed: 2018-03-18.

Share.

About Author

statshacker

statshacker is an Assistant Professor of Physics and Astronomy at a well-known state university. His research interests involve the development and application of concepts and techniques from the emerging field of data science to study large data sets. Outside of academic research, he is particularly interested in such data sets that arise in sports and finance. Contact: statshacker@statshacker.com