The game of baseball can be described, with remarkable accuracy, by certain probability models. Perhaps the most powerful and elegant of these is the Markov chain model.
Calculations from such models can provide tremendous insight into the game of baseball.
In this Article, the Markov chain model of baseball is described in detail. This follows first consideration of the structure of baseball. Applications of this model are then discussed. As as example, the expected number of runs per game are calculated for the 2010–2017 seasons. A discussion concludes.
This Article is part of the following series exploring the Markov chain model of baseball, and its utility:
- The Markov Chain Model of Baseball
- An Event-Based Framework for the Markov Chain Model of Baseball
- Statistical Analysis of the Stochastic Markov Matrices
Note also that the theoretical approach discussed here is implemented in the quantitative computational package _statshacker by statshacker.
The Structure of Baseball
The game of baseball has a discrete, well-defined and relatively “clean” structure.
At a high level, the structure of baseball has a well-defined, deterministic order: The game of baseball consists of games between teams. A game consists of nine innings. Each inning consists of a half inning. Each half inning consists of a sequence of (discrete) events, until three outs have been made.
It is the purpose of simulation to model this progression.
At any given time in a half inning of baseball (assuming that an event is not in progress; so before or after one, given that this is the case), the state can be described by the positions of the baserunner(s) and the number of outs.
There are eight possible baserunner positions (three bases, each of which can be occupied or not, for a total of possibilities); these states are shown in the following figure:
Note their numbering.
For each of the above states, there are also three possible numbers of outs (, , or ).
Considered together, there are a total of states.
The following numbering convention allows for simple mathematical organization (below):
- outs: —
- outs: —
- outs: —
(i.e., concatenation of the baserunner states with each out).
In order to completely describe a half inning though, these states must be augmented with those that end the half inning. These are the three-out states, and a number of runs scored,
- out made on play, no runs scored:
- out made on play, one run scored:
- out made on play, two runs scored:
- out made on play, three runs scored:
where the state number has also been indicated.
With these, this countable set of all possible configurations are specified; these are shown in the following figure:
This is known as the state space . Note that the states have been labeled in the format where is the set of baserunners ( denotes the empty set, and any set) and is the number of outs; the augmented states (bottom row) are further labeled with the number of runs scored after the “;”.
Progression is then described as a series of state transitions that occur due during each half inning.
The Markov Chain Model
A Markov chain is a stochastic model that has the Markov property: the conditional probability distribution of future states of the process depends only upon the present state, not on the sequence of events that preceded it.
The most basic question of validity is: “Does this process (the structure of baseball) have the Markov property?” Of practical importance, this question becomes: “Does the Markov chain model provide a reasonable approximation (the Markov assumption) from which we can develop our analysis?” 
Through applications of this model, the answer to the above (practical) question is “yes”.
The Markov chain model is mathematically described by linear algebra.
Consider in an experiment viewing a baseball game at random. It is clear that based on the above discussion that the game must be found in one of states.
a row vector (see below a note about conventions) which can be interpreted as a vector of probabilities, where the elements are the probability of being in state . Therefore, the vector components must sum to one,
and each individual component must have a probability between zero and one,
This is known as the state vector.
A discrete-time Markov chain is a sequence of random variables , , with the Markov property,
where denotes the one-step transition probability — i.e., the probability that the chain, whenever in state , moves next into the state .
The square matrix is a stochastic matrix, called the one-step transition matrix.
Note that the above follows the common convention to use row vectors of probabilities and right stochastic matrices.
Row of this matrix describes the probability of transition from state to all other states . The total of transition probability must be one,
Using the numbering convention from above, can be written in a particularly insightful form,
i.e., as a block matrix, for theoretical purpose. This is that the submatrices describe the following particular transitions:
- the matrices may change the position of base runners, but do not increase the numbers of outs
- the matrices may change the position of base runners, and increase the number of outs by one
- the matrix may change the position of base runners, and increases the number of outs from zero to two
- the , , and matrices increase the number of outs to end the inning
- the matrix describes the three-out states
- the blocks of zeroes represent transitions which decrease the number of outs, and thus each have a probability of zero.
The subscripts on each submatrix denote the number of outs prior to the event.
The transitions by , , and are straightforward to understand based on baserunner states and the associated numbering convention, considering the number of outs. Similar remarks for the transitions by , , based on the associated numbering convention for the three-out states. The matrix is precisely written as
Transitions from states — occur in order to satisfy Eq. (1); and only to State .
Note that this theoretical purpose and associated insight will be explored in a future article(s).
Transitions between states can be described by ,
(hence as a right stochastic matrix).
There are several applications of the Markov chain model.
Depending on the construction of and information that is used to calculate the transition probabilities in , different insight can be calculated.
The (perhaps earliest) work by Howard  considered the Markov process as a system model, and used dynamic programming as a method to determine the optimal time to bunt, with the goal of maximizing expected runs scored.
Later work by Pankin  discussed a comprehensive mathematical and statistical approach to lineup determination.
Bukiet, Harold, and Palacios  introduced a more general framework. This work considers teams made up of players with different abilities, and is not restricted to a given model of run advancement. They applied this to lineup determination, run distributions, expected number of games that a team should win, and trade analysis.
This list of examples is not meant to be complete. The Markov chain model has been used by other researchers; however, despite its power and elegance, its use is rare and usually in academic settings.
Example: Markov (Expected) Runs
As an example application, the expected number of runs per game for the American League were calculated for several seasons.
Calculating the Transition Matrix
The transition matrix was calculated from Retrosheet play-by-play data, for each year.
Only regular-season games were included in the calculation, including interleague play.
As an approximation, only batting events (i.e., where the batter changes) were included in the calculation. Note that including non-batting events leads to additional complexity, which will be considered in a future article.
The expected number of runs per half inning were calculated by performing one billion simulations. [Simulations (rather than calculation) offer some advantages. A future article will discuss simulations versus calculations.] The results are shown in the following table.
The “actual” results were approximated as the average number of runs scored and allowed (aggregated) per game divided by the average number of innings pitched (by the American League team) per game; the latter were obtained from Baseball-Reference.com.
The errors in the Markov results are shown to additional decimal places only for clarity; the precise error in all cases is run.
The Markov (expected) number of runs agrees very well with the actual results for all years. The former are slightly underestimated though, in all cases; on average, by % (see the discussion below).
The two sets of results are strongly correlated (the results increase or decrease together); Pearson’s correlation coefficient . Note that the correlation coefficient and standard deviation of the difference (see above) give similar information.
The strong correlation suggests that some stable (similar, season-over-season) and average effect is missing. A plausible explanation for the underestimation (above) is therefore that only batting events were used to calculate the transition matrix. This suggests that including non-batting events (such as base stealing) generally leads to more runs.
The Markov chain model of baseball was considered in detail, both qualitatively and mathematically. It was shown that because of the structure of baseball, this model provides a powerful and elegant description of the game. By an example application of this model, it was demonstrated that this description is remarkably accurate.
In future articles, the Markov chain model of baseball will be considered in even more detail; this includes more specific applications of it, and its utility.
 J. S. Sokol, “An Intuitive Markov Chain Lesson From Baseball,” INFORMS Transactions on Education 5, 47–55 (2004).
 R. A. Howard, Dynamic Programming and Markov Processes, 49–54 (MIT Press and Wiley, 1960)
 M. D. Pankin, “Finding Better Batting Orders,” SABR XXI (1991)
 B. Bukiet, E. R. Harold, and J. L. Palacios, “A Markov Chain Approach to Baseball,” Operations Research 45, 14–23 (1997).