The Markov Chain Model of Baseball

5

The game of baseball can be described, with remarkable accuracy, by certain probability models. Perhaps the most powerful and elegant of these is the Markov chain model.

Calculations from such models can provide tremendous insight into the game of baseball.

In this Article, the Markov chain model of baseball is described in detail. This follows first consideration of the structure of baseball. Applications of this model are then discussed. As as example, the expected number of runs per game are calculated for the 2010–2017 seasons. A discussion concludes.

This Article is part of the following series exploring the Markov chain model of baseball, and its utility:

Note also that the theoretical approach discussed here is implemented in the quantitative computational package _statshacker by statshacker.

The Structure of Baseball

The game of baseball has a discrete, well-defined and relatively “clean” structure.

Note that it is this which allows the concise description of games, such as event data from MLB Gameday and play-by-play data from Retrosheet.

In order to understand this structure, it is instructive to consider it first at a high level; then, by a more precise description in terms of baseball states and transitions between them.

High-Level Structure

At a high level, the structure of baseball has a well-defined, deterministic order: The game of baseball consists of games between teams. A game consists of nine innings. Each inning consists of a half inning. Each half inning consists of a sequence of (discrete) events, until three outs have been made.

It is the purpose of simulation to model this progression.

Baseball States

In order to describe this progression, it is necessary to be able to describe the state of a baseball game as an element of a countable set.

At any given time in a half inning of baseball (assuming that an event is not in progress; so before or after one, given that this is the case), the state can be described by the positions of the baserunner(s) and the number of outs.

There are eight possible baserunner positions (three bases, each of which can be occupied or not, for a total of 2^3 = 8 possibilities); these states are shown in the following figure:

Note their numbering.

For each of the above states, there are also three possible numbers of outs (0, 1, or 2).

Considered together, there are a total of 8 \times 3 = 24 states.

The following numbering convention allows for simple mathematical organization (below):

  • 0 outs: 18
  • 1 outs: 916
  • 2 outs: 1724

(i.e., concatenation of the baserunner states with each out).

In order to completely describe a half inning though, these states must be augmented with those that end the half inning. These are the three-out states, and a number of runs scored,

  • 3^\text{rd} out made on play, no runs scored: 25
  • 3^\text{rd} out made on play, one run scored: 26
  • 3^\text{rd} out made on play, two runs scored: 27
  • 3^\text{rd} out made on play, three runs scored: 28

where the state number has also been indicated.

With these, this countable set of all possible configurations are specified; these are shown in the following figure:

This is known as the state space \mathcal{S}. Note that the states have been labeled in the format (B,o) where B is the set of baserunners (\varnothing denotes the empty set, and * any set) and o is the number of outs; the augmented states (bottom row) are further labeled with the number of runs scored after the “;”.

State Transitions

Progression is then described as a series of state transitions that occur due during each half inning.

A state diagram for an example starting from the (3,1) state is shown in the following figure, using a directed graph to picture the state transitions.

The Markov Chain Model

Perhaps the most powerful and elegant models that describes the structure of the game of baseball is the Markov chain model.

A Markov chain is a stochastic model that has the Markov property: the conditional probability distribution of future states of the process depends only upon the present state, not on the sequence of events that preceded it.

The most basic question of validity is: “Does this process (the structure of baseball) have the Markov property?” Of practical importance, this question becomes: “Does the Markov chain model provide a reasonable approximation (the Markov assumption) from which we can develop our analysis?” [1]

Through applications of this model, the answer to the above (practical) question is “yes”.

Mathematical Structure

The Markov chain model is mathematically described by linear algebra.

Consider in an experiment viewing a baseball game at random. It is clear that based on the above discussion that the game must be found in one of 28 states.

The baseball state is therefore a discrete random variable. This can be described by a stochastic vector \textbf{x},

    \[ \textbf{x}^\mathrm{T} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_{28} \end{bmatrix} ~~~ , \]

a row vector (see below a note about conventions) which can be interpreted as a vector of probabilities, where the elements x_i are the probability of being in state i \in \mathcal{S}. Therefore, the vector components must sum to one,

    \begin{equation*} \sum_{i = 1}^{28} x_i = 1 ~~~ ; \end{equation*}

and each individual component must have a probability between zero and one,

    \begin{equation*} 0 \le x_i \le 1 ~~~ . \end{equation*}

This is known as the state vector.

A discrete-time Markov chain is a sequence of random variables \textbf{x}^0, \textbf{x}^1, \ldots with the Markov property,

    \begin{equation*} \operatorname{Pr}(\textbf{x}^{n+1} = \textbf{x}_j | \textbf{x}^n = \textbf{x}_i, \textbf{x}^{n-1} = \textbf{x}_k, \ldots, \textbf{x}^0 = \textbf{x}_l) = \operatorname{Pr}(\textbf{x}^{n+1} = \textbf{x}_j | \textbf{x}^n = \textbf{x}_i) = P_{i,j} \end{equation*}

where i, j, k, l \in \mathcal{S} P_{i,j} denotes the one-step transition probability — i.e., the probability that the chain, whenever in state i, moves next into the state j.

The square matrix \textbf{P} = [P_{i,j}] is a stochastic matrix, called the one-step transition matrix.

Note that the above follows the common convention to use row vectors of probabilities and right stochastic matrices.

Row i of this matrix describes the probability of transition from state i to all other states j. The total of transition probability must be one,

(1)   \begin{equation*} \sum_{j = 1}^{28} P_{i,j} = 1 ~~~ . \end{equation*}

Using the numbering convention from above, \textbf{P} can be written in a particularly insightful form,

    \begin{equation*} \textbf{P} = \begin{bmatrix} \textbf{A}_0 & \textbf{B}_0 & \textbf{C}_0 & \textbf{D}_0 \\ \mathbf{0} & \textbf{A}_1 & \textbf{B}_1 & \textbf{E}_1 \\ \mathbf{0} & \mathbf{0} & \textbf{A}_2 & \textbf{F}_2 \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \textbf{1} \end{bmatrix} ~~~ ; \end{equation*}

i.e., as a block matrix, for theoretical purpose. This is that the submatrices describe the following particular transitions:

  • the 8{\times}8 \textbf{A} matrices may change the position of base runners, but do not increase the numbers of outs
  • the 8{\times}8 \textbf{B} matrices may change the position of base runners, and increase the number of outs by one
  • the 8{\times}8 \textbf{C} matrix may change the position of base runners, and increases the number of outs from zero to two
  • the 8{\times}4 \textbf{D}\textbf{E}, and \textbf{F} matrices increase the number of outs to end the inning
  • the 4{\times}4 \textbf{1} matrix describes the three-out states
  • the blocks of zeroes represent transitions which decrease the number of outs, and thus each have a probability of zero.

The subscripts on each submatrix denote the number of outs prior to the event.

The 8{\times}8 transitions by \textbf{A}, \textbf{B}, and \textbf{C} are straightforward to understand based on baserunner states and the associated numbering convention, considering the number of outs. Similar remarks for the 8{\times}4 transitions by \textbf{D}, \textbf{E}, \textbf{F} based on the associated numbering convention for the three-out states. The \textbf{1} matrix is precisely written as

    \begin{equation*} \textbf{1} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \end{bmatrix} ~~~ . \end{equation*}

Transitions from states 2628 occur in order to satisfy Eq. (1); and only to State 25.

State 25 is the final one of the half inning, from which no further transitions occur. It is therefore an absorbing state; and, technically, the Markov chain is an absorbing Markov chain.

Note that this theoretical purpose and associated insight will be explored in a future article(s).

Transitions between states \textbf{x}^n \rightarrow \textbf{x}^{n+1} can be described by \textbf{P},

    \begin{equation*} \textbf{x}^{n+1} = \textbf{x}^n \textbf{P} \end{equation*}

(hence \textbf{P} as a right stochastic matrix).

Applications

There are several applications of the Markov chain model.

Depending on the construction of and information that is used to calculate the transition probabilities in \textbf{P}, different insight can be calculated.

A few notable examples are first described below. This is followed by an example application.

Notable Examples

The (perhaps earliest) work by Howard [2] considered the Markov process as a system model, and used dynamic programming as a method to determine the optimal time to bunt, with the goal of maximizing expected runs scored.

Later work by Pankin [3] discussed a comprehensive mathematical and statistical approach to lineup determination.

Bukiet, Harold, and Palacios [4] introduced a more general framework. This work considers teams made up of players with different abilities, and is not restricted to a given model of run advancement. They applied this to lineup determination, run distributions, expected number of games that a team should win, and trade analysis.

This list of examples is not meant to be complete. The Markov chain model has been used by other researchers; however, despite its power and elegance, its use is rare and usually in academic settings.

Example: Markov (Expected) Runs

As an example application, the expected number of runs per game for the American League were calculated for several seasons.

Calculating the Transition Matrix

The transition matrix was calculated from Retrosheet play-by-play data, for each year.

Only regular-season games were included in the calculation, including interleague play.

As an approximation, only batting events (i.e., where the batter changes) were included in the calculation. Note that including non-batting events leads to additional complexity, which will be considered in a future article.

Markov Runs

The expected number of runs per half inning were calculated by performing one billion simulations. [Simulations (rather than calculation) offer some advantages. A future article will discuss simulations versus calculations.] The results are shown in the following table.

YearActualMarkovDifference (%)
20170.5260.509(1.043)3.2
20160.5060.491(1.008)3.0
20150.4850.472(0.993)2.7
20140.4640.450(0.973)3.0
20130.4780.463(0.981)3.1
20120.4930.480(1.003)2.6
20110.4930.480(1.003)2.6
20100.4980.483(1.008)3.0

The “actual” results were approximated as the average number of runs scored and allowed (aggregated) per game divided by the average number of innings pitched (by the American League team) per game; the latter were obtained from Baseball-Reference.com.

The errors in the Markov results are shown to additional decimal places only for clarity; the precise error in all cases is \pm 1 run.

The Markov (expected) number of runs agrees very well with the actual results for all years. The former are slightly underestimated though, in all cases; on average, by 2.9(2)% (see the discussion below).

The two sets of results are strongly correlated (the results increase or decrease together); Pearson’s correlation coefficient r = 0.998. Note that the correlation coefficient and standard deviation of the difference (see above) give similar information.

The strong correlation suggests that some stable (similar, season-over-season) and average effect is missing. A plausible explanation for the underestimation (above) is therefore that only batting events were used to calculate the transition matrix. This suggests that including non-batting events (such as base stealing) generally leads to more runs.

Discussion

The Markov chain model of baseball was considered in detail, both qualitatively and mathematically. It was shown that because of the structure of baseball, this model provides a powerful and elegant description of the game. By an example application of this model, it was demonstrated that this description is remarkably accurate.

In future articles, the Markov chain model of baseball will be considered in even more detail; this includes more specific applications of it, and its utility.

References

[1] J. S. Sokol, “An Intuitive Markov Chain Lesson From Baseball,” INFORMS Transactions on Education 5, 47–55 (2004).

[2] R. A. Howard, Dynamic Programming and Markov Processes, 49–54 (MIT Press and Wiley, 1960)

[3] M. D. Pankin, “Finding Better Batting Orders,” SABR XXI (1991)

[4] B. Bukiet, E. R. Harold, and J. L. Palacios, “A Markov Chain Approach to Baseball,” Operations Research 45, 14–23 (1997).

Share.

About Author

statshacker is an Assistant Professor of Physics and Astronomy at a well-known state university. His research interests involve the development and application of concepts and techniques from the emerging field of data science to study large data sets. Outside of academic research, he is particularly interested in such data sets that arise in sports and finance. Contact: statshacker@statshacker.com