An Event-Based Framework for the Markov Chain Model of Baseball

3

The game of baseball can be described, with remarkable accuracy, by certain probability models.

The Markov chain model is perhaps the most powerful and elegant of these. This model has been described in detail a previous article. Important details will be reviewed below.

The model itself, however, does not provide details about the sources of the probabilities. For baseball, these are the events which constitute the game.

In this Article, an event-based framework for the Markov chain model of baseball is presented and discussed in detail.

This Article is outlined as follows. The event-based description of baseball is first discussed. The Markov chain model of baseball is then concisely reviewed. An event-based framework for this model is then described. As an example application, a batting-event model is considered. Some notes on strategies are discussed, in this context. This includes results for the number of runs scored in a half inning of baseball, following a discussion of the methods used used to calculate them. A discussion concludes.

This Article is part of the following series exploring the Markov chain model of baseball, and its utility:

Note also that the theoretical approach discussed here is implemented in the quantitative computational package _statshacker by statshacker.

To cite this Article:

statshacker, “An Event-Based Framework for the Markov Chain Model of Baseball,” statshacker [http://statshacker.com/an-event-based-framework-for-the-markov-chain-model-of-baseball] Accessed: YYYY-MM-DD

statshacker, “Statistical Analysis of the Stochastic Markov Matrices,” statshacker [http://statshacker.com/statistical-analysis-of-the-stochastic-markov-matrices] Accessed: YYYY-MM-DD

The Event-Based Description of Baseball

The discrete, well-defined and relatively “clean” structure of the game of baseball was discussed in a previous article, in terms of baseball states and transitions between them.

This structure can be described in more detail in terms of the cause of these transitions; which are the events which constitute the game.

Through the reporting of event data of games (e.g., MLB Gameday), and the creation of computer databases of their play-by-play accounts (e.g., Retrosheet), it is realized that concise descriptions can be made in terms of them.

Some of these events are discussed below, in the context of a batting-event model of baseball.

The Markov Chain Model of Baseball

The Markov chain model of baseball has been discussed in detail a previous article.

This model provides a stochastic description of the game of baseball in terms of states and transitions between them.

The latter are described by the transition matrix \textbf{P},

(1)   \begin{equation*} \textbf{P} = \begin{bmatrix} \textbf{A}_0 & \textbf{B}_0 & \textbf{C}_0 & \textbf{D}_0 \\ \mathbf{0} & \textbf{A}_1 & \textbf{B}_1 & \textbf{E}_1 \\ \mathbf{0} & \mathbf{0} & \textbf{A}_2 & \textbf{F}_2 \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \textbf{1} \end{bmatrix} \end{equation*}

where \textbf{A}\textbf{B}\textbf{C}\textbf{D}\textbf{E}\textbf{F}, and \textbf{1} describe particular transitions, and the subscripts denote the number of outs (see here); these notations will be used herein.

Depending on the construction of and events that are used to calculate the transition probabilities in \textbf{P}, different insight(s) can be obtained.

Below, this matrix is decomposed in terms of events that that further describe the game of baseball.

An Event-Based Framework

Bukiet, Harold, and Palacios (implicitly) introduced [1] an event-based framework for the Markov chain model of baseball.

This framework is first concisely summarized below. A more general one is then presented, which extends upon the underlying idea.

Bukiet, Harold, and Palacios Framework

Bukiet, Harold, and Palacios fundamentally considered [1] the idea to construct a transition matrix [Eq. (1)] for each player [which (obviously) have different abilities]. Note that this is not considered herein; but the following consideration provides the necessary motivation.

They further considered that if the exact probabilities of a player changing one state of a game to another are not known, then the transition matrix can be filled with using any model (italics for emphasis, for below) of how particular events advance runners combined with statistics about how often such events occur.  (Specific consideration in Ref. [1] was on batting events, which is expanded upon below.)

This latter “framework” is generalized herein.

More General Framework

The Bukiet, Harold, and Palacios framework [1] can be generalized.

Consider the decomposition of the matrix in Eq. (1) as

(2)   \begin{equation*} \textbf{P} = \sum_{i \in S} p_i \textbf{P}_i \end{equation*}

where the label i in the sum runs over the possible number of events i \in S where S denotes the sample space, p_i is the probability of this event, and \textbf{P}_i is its transition matrix.

There are several advantages to writing \textbf{P} in this form; two are as follows:

One is the separation of the probability of an event from its outcome (which may also be stochastic).

These (completely) describe different aspects of an event. The former is how likely it is to occur. This changes from season to season, and (certainly) team to team or player to player. The latter describes runner advancement. This may not change as significantly from season to season, and (under an assumption of uniformity) not from team to team or player to player.

Another advantage results from that the events can be divided into those for which baserunner advancement is either analytical or stochastic, the latter with elements that can analyzed statistically (e.g., from play-by-play data). Note that a “third” category of model baserunner advancement may also be useful to approximate stochastic matrix elements (or entire matrices) when there is insufficient data.

This division makes it straightforward to isolate and study aspects of each event individually.

Note that division of baserunner advancement into two such categories of events has recently been considered [2] in the (related, but different) context of simulation; whereas, using play-by-play data to calculate transition matrices dates back at least to Pankin [3].

This approach can be contrasted to (the more common one of) using a deterministic model for baserunner advancement; perhaps the most common of these is that by D’Esopo and Lefkowitz [4]. This latter approach will be considered again below. The former one will therefore be termed the data-driven model.

These ideas are expanded upon herein, and applied to the Markov chain model of baseball, as described in a previous article and above.

They are made explicit below for a batting-event model of baseball.

Example Application: A (Simplified) Model of Baseball

As an example, a (simplified) model of baseball is considered.

Note that “(simplified)”, in this context, means only that the sample space is incomplete. This is necessary, however, in order to create a model with well-defined states that accurately and completely describe the space. The events included though do account for a significant portion of the effects in the game (see below, and the results).

In future articles, this space will be enlarged, to encompass additional effects.

(Simplified) Batting-Event Model

The batting-event model, as the name implies, is based on consideration of batting events.

This model was considered (though, in a total probability context, and not in terms of events) as an example application in a previous article, and found to provide an accurate number of expected runs in a half inning of baseball.

The sample space for this model is as follows:

(3)   \begin{equation*} S = \{\text{K}, \text{BB}, \text{HBP}, \text{out}, \text{1B}, \text{2B}, \text{3B}, \text{HR}\} \end{equation*}

where \text{K} denotes strikeout, \text{BB} is base on balls, \text{HBP} is hit by pitch, \text{out} is outs other than strikeout (considered in more detail below), \text{1B}\text{2B}, and \text{3B} are singles, doubles, and triples, respectively, and \text{HR} is home run.

Note that consideration of intentional base on balls are discussed in the context of strategies.

The \text{out} event requires some discussion.

This event is considered an “all other” category, for events including outs; as will be clarified and discussed further in the context of event probabilities. It consists of the following (specific) events:

  • \text{out (generic)}
  • \text{foul error}
  • \text{error}
  • \text{fielder's choice}

The latter three (specific) events are properly included as components, as the batter would have been out, had they not occurred.

Consideration of sacrifice plays, also as part of this event, are discussed in the context of strategies.

Events Not Considered

Non-batting events are not considered in the batting-event model.

This excludes the following events related to base stealing:

  • stolen base
  • defensive indifference
  • caught stealing
  • pickoff;

or the following:

  • other advance / out advancing.

The following pitching-specific events (often related to the latter event above) are also not considered:

  • wild pitch
  • passed ball.

Note that these events may involve the batter-runner; these are mentioned below, in the context of strikeouts.

Finally, the following events that illegally change the course of play are not considered:

  • interference
  • balk.

Event Probabilities

Depending on the source of event probabilities, different information can be calculated. For example, using league averages, the expected number of runs in a (general) league game can be calculated; this is done below.

In any case, event probabilities are calculable from readily-available statistics.

Approximating that the event probabilities are situation independent, the following approach can be used:

Given the finite sample space [Eq. (3)], event probabilities are relative to (modified) plate appearances \text{PA}'; calculated from statistics as

(4)   \begin{equation*} \text{PA}' = \text{AB} + \text{BB} + \text{HBP} + \text{SH} + \text{SF} \end{equation*}

where \text{AB} is at bat, and \text{SH} and \text{SF} are sacrifice hits and flys, respectively. [Modified, because catcher’s interference is not included,

    \begin{equation*} \text{PA}' = \text{PA} - \text{CINT} \end{equation*}

where \text{PA} is plate appearances (not modified) and \text{CINT} catcher’s interference.]

For example,

    \begin{equation*} p_\text{1B} = \frac{\text{H} - \text{2B} - \text{3B} - \text{HR}}{\text{PA}'} \end{equation*}

where \text{H} is hits.

p_\text{out}, however, is inferred as

    \begin{equation*} p_\text{out} = 1 - \sum_{\substack{i \in \mathcal{S}, \\ i \neq \text{out}}} p_i ~~~ . \end{equation*}

Hence its consideration as an “all other” category. This leads to a slight list–length type of effect. The probability associated with an out is, in this case, (obviously) increased. Though, by proper consideration of events included or not, including in the calculations of probabilities, this effect is minimized.

The validity of this approximation depends on the event. This will be discussed below, in the context of strategies.

Event Transition-Matrices

In the general framework, event transition-matrices can be divided into the following two primary categories:

augmented by the following third one:

Analytical Matrices

The following events are for which baserunner advancement is analytical:

  • \text{K}
  • \text{BB}
  • \text{HBP}
  • \text{HR}.

The precise forms of these matrices are reported below.

\textbf{P}_\text{K} Matrix

A strikeout (in this consideration — see the following note) increases the number of outs by one, while leaving the baserunner positions unaffected.

Note that this does not consider events where the catcher does not catch the third strike (possibly scored as a wild pitch or passed ball), and the batter-runner reaches first base safely.

The \textbf{P}_\text{K} matrix therefore takes the form

    \begin{equation*} \textbf{P}_\text{K} = \begin{bmatrix} 0 & \textbf{B} & 0 & 0 \\ 0 & 0 & \textbf{B} & 0 \\ 0 & 0 & 0 & \textbf{F} \\ 0 & 0 & 0 & 0 \end{bmatrix} \end{equation*}

where

    \begin{equation*} \textbf{B} = \textbf{I} \end{equation*}

where \textbf{I} is the 8{\times}8 identity matrix, and \textbf{F} is

    \begin{equation*} \textbf{F} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \end{bmatrix} ~~~ . \end{equation*}

\textbf{P}_\text{BB} Matrix

A base on balls puts the batter-runner on first base, baserunners advance only if forced, and the number of outs is not increase.

The \textbf{P}_\text{BB} matrix therefore takes the form

(5)   \begin{equation*} \textbf{P}_\text{BB} = \begin{bmatrix} \textbf{A} & 0 & 0 & 0 \\ 0 & \textbf{A} & 0 & 0 \\ 0 & 0 & \textbf{A} & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} \end{equation**}

where

    \begin{equation*} \textbf{A} = \begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} ~~~ . \end{equation}

\textbf{P}_\text{HBP} Matrix

A hit by pitch results in the same way as a base on balls.

Note, however, the subtle difference that, in this case, the ball is dead; whereas, when a walk occurs, the ball is live. Since base-stealing events are not considered (and would be handled separately anyway), this does not affect the model.

In any case, since the decomposition in Eq. (2) is in terms that the event transition-matrix describes the probabilistic outcome of only that event,

    \begin{equation*} \textbf{P}_\text{HBP} = \textbf{P}_\text{BB} ~~~ . \end{equation*}

\textbf{P}_\text{HR} Matrix

A home run clears the bases, while not increasing the number of outs.

The \textbf{HR} matrix therefore takes the form

    \begin{equation*} \textbf{P}_\text{HR} = \begin{bmatrix} \textbf{A} & 0 & 0 & 0 \\ 0 & \textbf{A} & 0 & 0 \\ 0 & 0 & \textbf{A} & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} \end{equation*}

where

    \begin{equation*} \textbf{A} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix} ~~~ . \end{equation*}

Stochastic Matrices

The following events are for which baserunner advancement is stochastic:

  • \text{out}
  • \text{1B}
  • \text{2B}
  • \text{3B}.

The precise forms of these matrices (in a different focus of baserunner advancement) will be considered in a future article(s).

(Deterministic) Model Matrices

The stochastic matrices are in principle possible to analyze statistically.

In practice, however, there can often be insufficient data to calculate all transition matrix elements with sample proportions significantly close to population ones. This is dependent on the type of event (e.g., a triple is the rarest type of hit) and the number of events (samples) available.

In these cases, the stochastic matrix elements (or entire matrices) can be approximated by any deterministic model of baserunner advancement.

Perhaps the most common of these, as mentioned above, is that by D’Esopo and Lefkowitz [4]. Note that this model though is somewhat conservative for the events \text{out} and \text{1B}, and neutral for \text{2B} and \text{3B} (see below, though); overall, it is conservative.

Triples, for example, in this model [4], put the batter-runner on third base, and all baserunners score. (The number of outs do not change.)

The \text{3B} matrix therefore takes the form

(6)   \begin{equation*} \textbf{P}_\text{3B} = \begin{bmatrix} \textbf{A} & 0 & 0 & 0 \\ 0 & \textbf{A} & 0 & 0 \\ 0 & 0 & \textbf{A} & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} \end{equation*}

where

    \begin{equation*} \textbf{A} = \begin{bmatrix} 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \end{bmatrix} ~~~ . \end{equation*}

This is quite accurate; though, it does not allow for outs or scoring by the batter-runner.

This model [4] and stochastic matrices will be considered further in a future article.

How this matrix is considered herein is discussed in the Methods section.

Notes on Strategies

Strategies (both offensive and defensive) are situation dependent.

Situation information (e.g., in event probabilities), however, is not considered in this application.

This poses a challenge for consideration of some types of events or plays.

Consider sacrifice plays; and the following two options for their consideration:

By including such, the number of (modified) plate appearances is increased [Eq. (4)]. This increases the probability of an \text{out}, and decreases those of all other events. These probabilities though are assumed to be irrespective of the baseball state. Since these plays occur relatively infrequently, the probability of an \text{out} event is mostly overestimated.

By not including these plays, however, the transition matrix elements do not realistically reflect the progression of a game of baseball.

Related arguments can be made for intentional base on balls. Such events occur most frequently with an excellent hitter at the plate and a significantly worse one on deck. Including such events realistically describes this occurrence on average. Not described, however, is that these situations generally occur with no one on first base.

Note that related arguments could be made for some other types of events also.

Strategies will be considered in more detail in a future article.

Methods

Results were calculated for the American League regular-season games (including interleague play) for the 2010–2017 seasons.

Batting-Event Model

Retrosheet play-by-play data was used to calculate transition matrices for the (simplified) batting-event model.

These were calculated using single-season data (as done in a previous article).

An Event-Based Framework

Event Probabilities

Baseball-Reference.com was used to calculate event probabilities for the league.

Only statistics for the American League teams were included in these calculations (even in the consideration of interleague games), because it is these that are reported (e.g., see here, for the 2017 season). Results are therefore correspondingly compared (below) for runs scored only by these teams.

Transition Matrices

Retrosheet play-by-play data was also used to calculate the stochastic transition matrices.

Baserunner advancement was assumed to be league independent. This allows for more than twice as much data to be used in the calculations.

Stochastic matrices were calculated using an average of single-season ones, over five seasons; and truncated at the upper part of this range, for more recent seasons.

The deterministic \text{3B} matrix [Eq. (6)] was used (entirely).

Simulations

The Markov chain model was simulated one billion times.

A future article will discuss simulations versus calculations.

Results

The Markov (expected) number of runs in a half inning calculated and compared to the “actual” results are shown in the following table:

YearActualMarkovDifference (%)
20170.5290.517 (1.045)2.4
20160.5080.503 (1.029)0.9
20150.4910.483 (1.004)1.8
20140.4660.453 (0.973)2.7
20130.4830.481 (1.006)0.3
20120.49750.490 (1.014)1.5
20110.49870.489 (1.015)1.8
20100.49920.498 (1.030)0.1

The “actual” results were approximated as the average number of runs scored per game divided by the average number of innings pitched (both only by the American League teams) per game; these results were obtained from Baseball-Reference.com.

The Markov results are shown to three decimal places, as this precision is stable from simulation to simulation. The errors are shown to the same number of decimal places, only for clarity; the precise error in all cases is \pm 1 run.

The Markov number of runs agrees very well with the actual results, for all seasons. They are slightly underestimated in all cases though (but, as expected); on average, only by 1.5(9)%.

In addition, the two sets of results are strongly correlated; Pearson’s correlation coefficient r = 0.969.

Difficult results to get correct are the subtle differences between seasons 2010–2012. The actual results are nearly identical; and, so are each shown in the above table to one additional decimal place. The Markov results, however, show two problems: (i) they do not predict the relative difference between these seasons well (compare the difference from 2010 to either of the other two seasons); moreover, (ii) the wrong ordering is predicted for seasons 2011 and 2012 [more runs are (incorrectly) predicted for the latter]. Improvements to these problems will be discussed in a future article.

Comparing these results to the (basic) Markov chain model of baseball, they are seen to be 1.4% closer (on average). It is tempting to conclude that these results are therefore (much) better. Consideration must be made in the meaning of “better”, however. Any plausible explanations attributing this to (only) the closer number of runs ultimately seem to be unjustified.

The (basic) model is “complete”, in that all batting events are precisely accounted for (though transition probabilities). In addition, since these are calculated for only the season of interest, the probabilities are modeled precisely (for that season). Indeed, these are reflected by the significantly-high correlation coefficient.

Therefore, the basic model represents a best-case scenario. It achieves this, however, at the expense of overfitting to the data; this is clear from that any transition matrix calculated for a season applies only to that season.

With this consideration, a plausible explanation for the much lower correlation (similarly, higher standard deviation), in this case, can be made in terms of the stochastic nature of the game of baseball.

Consider that, in this model, the events which constitute the game of baseball were complete, and precisely resolved. (The resolution of events in the transition matrices will be discussed in a future article.) Due to the stochasticity, however, a finite number of events over a single season may not follow this resolution precisely. This would therefore lead to a lower correlation between the model and actual data.

This argument is based on the assumption of a complete, and precisely resolved model.

For any (actual) model, there will also be an attribution to the lower correlation from the lack of either one of these properties. Considering a “baseline” model (e.g., that in this Article), though, it is expected that improvements will increase this aspect of the correlation with the actual data.

Discussion

An event-based framework for the Markov chain model of baseball was presented and discussed in detail. The fundamental idea is the decomposition of the (total) transition matrix [Eq. (1)] as a sum over individual events [Eq. (2)], products of event probabilities and transition-matrices.

Some advantages of this approach were already discussed above.

Of particular focus was that event transition-matrices are categorized into analytical or stochastic. The former were presented; and the latter were discussed to be calculable, and augmented (if need be) by a model for baserunner advancement.

A statistical analysis of the stochastic matrices will be considered in a future article.

There is much insight that could be obtained from the presented framework and application.

There is also much room for improvement. For the basic framework, this includes state-based event probabilities, as discussed in the context of strategies. Going beyond the batting-event model will require including additional effects (see above).

All of this (and more) will be considered in future articles.

References

[1] B. Bukiet, E. R. Harold, and J. L. Palacios, “A Markov Chain Approach to Baseball,” Operations Research 45, 14–23 (1997)

[2] D. Beaudoin, “Various applications to a more realistic baseball simulator,” Journal of Quantitative Analysis in Sports 9, 271–283 (2013)

[3] M. D. Pankin, “Finding Better Batting Orders,” SABR XXI (1991)

[4] D. A. D’Esopo and B. Lefkowitz, “The Distribution of Runs in the Game of Baseball,” SRI Internal Report (1960)

Share.

About Author

statshacker is an Assistant Professor of Physics and Astronomy at a well-known state university. His research interests involve the development and application of concepts and techniques from the emerging field of data science to study large data sets. Outside of academic research, he is particularly interested in such data sets that arise in sports and finance. Contact: statshacker@statshacker.com