The game of baseball can be described, with remarkable accuracy, by certain probability models.
The Markov chain model is perhaps the most powerful and elegant of these. This model has been described in detail a previous article. Important details will be reviewed below.
The model itself, however, does not provide details about the sources of the probabilities. For baseball, these are the events which constitute the game.
In this Article, an event-based framework for the Markov chain model of baseball is presented and discussed in detail.
This Article is outlined as follows. The event-based description of baseball is first discussed. The Markov chain model of baseball is then concisely reviewed. An event-based framework for this model is then described. As an example application, a batting-event model is considered. Some notes on strategies are discussed, in this context. This includes results for the number of runs scored in a half inning of baseball, following a discussion of the methods used used to calculate them. A discussion concludes.
This Article is part of the following series exploring the Markov chain model of baseball, and its utility:
- The Markov Chain Model of Baseball
- An Event-Based Framework for the Markov Chain Model of Baseball
- Statistical Analysis of the Stochastic Markov Matrices
Note also that the theoretical approach discussed here is implemented in the quantitative computational package _statshacker by statshacker.
To cite this Article:
statshacker, “An Event-Based Framework for the Markov Chain Model of Baseball,” statshacker [http://statshacker.com/an-event-based-framework-for-the-markov-chain-model-of-baseball] Accessed: YYYY-MM-DD
statshacker, “Statistical Analysis of the Stochastic Markov Matrices,” statshacker [http://statshacker.com/statistical-analysis-of-the-stochastic-markov-matrices] Accessed: YYYY-MM-DD
The Event-Based Description of Baseball
The discrete, well-defined and relatively “clean” structure of the game of baseball was discussed in a previous article, in terms of baseball states and transitions between them.
This structure can be described in more detail in terms of the cause of these transitions; which are the events which constitute the game.
Through the reporting of event data of games (e.g., MLB Gameday), and the creation of computer databases of their play-by-play accounts (e.g., Retrosheet), it is realized that concise descriptions can be made in terms of them.
Some of these events are discussed below, in the context of a batting-event model of baseball.
The Markov Chain Model of Baseball
The Markov chain model of baseball has been discussed in detail a previous article.
This model provides a stochastic description of the game of baseball in terms of states and transitions between them.
The latter are described by the transition matrix ,
(1)
where ,
,
,
,
,
, and
describe particular transitions, and the subscripts denote the number of outs (see here); these notations will be used herein.
Depending on the construction of and events that are used to calculate the transition probabilities in , different insight(s) can be obtained.
Below, this matrix is decomposed in terms of events that that further describe the game of baseball.
An Event-Based Framework
Bukiet, Harold, and Palacios (implicitly) introduced [] an event-based framework for the Markov chain model of baseball.
This framework is first concisely summarized below. A more general one is then presented, which extends upon the underlying idea.
Bukiet, Harold, and Palacios Framework
Bukiet, Harold, and Palacios fundamentally considered [] the idea to construct a transition matrix [Eq. (1)] for each player [which (obviously) have different abilities]. Note that this is not considered herein; but the following consideration provides the necessary motivation.
They further considered that if the exact probabilities of a player changing one state of a game to another are not known, then the transition matrix can be filled with using any model (italics for emphasis, for below) of how particular events advance runners combined with statistics about how often such events occur. (Specific consideration in Ref. [] was on batting events, which is expanded upon below.)
This latter “framework” is generalized herein.
More General Framework
The Bukiet, Harold, and Palacios framework [] can be generalized.
Consider the decomposition of the matrix in Eq. (1) as
(2)
where the label in the sum runs over the possible number of events
where
denotes the sample space,
is the probability of this event, and
is its transition matrix.
There are several advantages to writing in this form; two are as follows:
One is the separation of the probability of an event from its outcome (which may also be stochastic).
These (completely) describe different aspects of an event. The former is how likely it is to occur. This changes from season to season, and (certainly) team to team or player to player. The latter describes runner advancement. This may not change as significantly from season to season, and (under an assumption of uniformity) not from team to team or player to player.
Another advantage results from that the events can be divided into those for which baserunner advancement is either analytical or stochastic, the latter with elements that can analyzed statistically (e.g., from play-by-play data). Note that a “third” category of model baserunner advancement may also be useful to approximate stochastic matrix elements (or entire matrices) when there is insufficient data.
This division makes it straightforward to isolate and study aspects of each event individually.
Note that division of baserunner advancement into two such categories of events has recently been considered [] in the (related, but different) context of simulation; whereas, using play-by-play data to calculate transition matrices dates back at least to Pankin [
].
This approach can be contrasted to (the more common one of) using a deterministic model for baserunner advancement; perhaps the most common of these is that by D’Esopo and Lefkowitz []. This latter approach will be considered again below. The former one will therefore be termed the data-driven model.
These ideas are expanded upon herein, and applied to the Markov chain model of baseball, as described in a previous article and above.
They are made explicit below for a batting-event model of baseball.
Example Application: A (Simplified) Model of Baseball
As an example, a (simplified) model of baseball is considered.
Note that “(simplified)”, in this context, means only that the sample space is incomplete. This is necessary, however, in order to create a model with well-defined states that accurately and completely describe the space. The events included though do account for a significant portion of the effects in the game (see below, and the results).
In future articles, this space will be enlarged, to encompass additional effects.
(Simplified) Batting-Event Model
The batting-event model, as the name implies, is based on consideration of batting events.
This model was considered (though, in a total probability context, and not in terms of events) as an example application in a previous article, and found to provide an accurate number of expected runs in a half inning of baseball.
The sample space for this model is as follows:
(3)
where denotes strikeout,
is base on balls,
is hit by pitch,
is outs other than strikeout (considered in more detail below),
,
, and
are singles, doubles, and triples, respectively, and
is home run.
Note that consideration of intentional base on balls are discussed in the context of strategies.
The event requires some discussion.
This event is considered an “all other” category, for events including outs; as will be clarified and discussed further in the context of event probabilities. It consists of the following (specific) events:
The latter three (specific) events are properly included as components, as the batter would have been out, had they not occurred.
Consideration of sacrifice plays, also as part of this event, are discussed in the context of strategies.
Events Not Considered
Non-batting events are not considered in the batting-event model.
This excludes the following events related to base stealing:
- stolen base
- defensive indifference
- caught stealing
- pickoff;
or the following:
- other advance / out advancing.
The following pitching-specific events (often related to the latter event above) are also not considered:
- wild pitch
- passed ball.
Note that these events may involve the batter-runner; these are mentioned below, in the context of strikeouts.
Finally, the following events that illegally change the course of play are not considered:
- interference
- balk.
Event Probabilities
Depending on the source of event probabilities, different information can be calculated. For example, using league averages, the expected number of runs in a (general) league game can be calculated; this is done below.
In any case, event probabilities are calculable from readily-available statistics.
Approximating that the event probabilities are situation independent, the following approach can be used:
Given the finite sample space [Eq. (3)], event probabilities are relative to (modified) plate appearances ; calculated from statistics as
(4)
where is at bat, and
and
are sacrifice hits and flys, respectively. [Modified, because catcher’s interference is not included,
where is plate appearances (not modified) and
catcher’s interference.]
For example,
where is hits.
, however, is inferred as
Hence its consideration as an “all other” category. This leads to a slight list–length type of effect. The probability associated with an out is, in this case, (obviously) increased. Though, by proper consideration of events included or not, including in the calculations of probabilities, this effect is minimized.
The validity of this approximation depends on the event. This will be discussed below, in the context of strategies.
Event Transition-Matrices
In the general framework, event transition-matrices can be divided into the following two primary categories:
augmented by the following third one:
Analytical Matrices
The following events are for which baserunner advancement is analytical:
.
The precise forms of these matrices are reported below.
Matrix
A strikeout (in this consideration — see the following note) increases the number of outs by one, while leaving the baserunner positions unaffected.
Note that this does not consider events where the catcher does not catch the third strike (possibly scored as a wild pitch or passed ball), and the batter-runner reaches first base safely.
The matrix therefore takes the form
where
where is the
identity matrix, and
is
Matrix
A base on balls puts the batter-runner on first base, baserunners advance only if forced, and the number of outs is not increase.
The matrix therefore takes the form
(5)
where
Matrix
A hit by pitch results in the same way as a base on balls.
Note, however, the subtle difference that, in this case, the ball is dead; whereas, when a walk occurs, the ball is live. Since base-stealing events are not considered (and would be handled separately anyway), this does not affect the model.
In any case, since the decomposition in Eq. (2) is in terms that the event transition-matrix describes the probabilistic outcome of only that event,
Matrix
A home run clears the bases, while not increasing the number of outs.
The matrix therefore takes the form
where
Stochastic Matrices
The following events are for which baserunner advancement is stochastic:
.
The precise forms of these matrices (in a different focus of baserunner advancement) will be considered in a future article(s).
(Deterministic) Model Matrices
The stochastic matrices are in principle possible to analyze statistically.
In practice, however, there can often be insufficient data to calculate all transition matrix elements with sample proportions significantly close to population ones. This is dependent on the type of event (e.g., a triple is the rarest type of hit) and the number of events (samples) available.
In these cases, the stochastic matrix elements (or entire matrices) can be approximated by any deterministic model of baserunner advancement.
Perhaps the most common of these, as mentioned above, is that by D’Esopo and Lefkowitz []. Note that this model though is somewhat conservative for the events
and
, and neutral for
and
(see below, though); overall, it is conservative.
Triples, for example, in this model [], put the batter-runner on third base, and all baserunners score. (The number of outs do not change.)
The matrix therefore takes the form
(6)
where
This is quite accurate; though, it does not allow for outs or scoring by the batter-runner.
This model [] and stochastic matrices will be considered further in a future article.
How this matrix is considered herein is discussed in the Methods section.
Notes on Strategies
Strategies (both offensive and defensive) are situation dependent.
Situation information (e.g., in event probabilities), however, is not considered in this application.
This poses a challenge for consideration of some types of events or plays.
Consider sacrifice plays; and the following two options for their consideration:
By including such, the number of (modified) plate appearances is increased [Eq. (4)]. This increases the probability of an , and decreases those of all other events. These probabilities though are assumed to be irrespective of the baseball state. Since these plays occur relatively infrequently, the probability of an
event is mostly overestimated.
By not including these plays, however, the transition matrix elements do not realistically reflect the progression of a game of baseball.
Related arguments can be made for intentional base on balls. Such events occur most frequently with an excellent hitter at the plate and a significantly worse one on deck. Including such events realistically describes this occurrence on average. Not described, however, is that these situations generally occur with no one on first base.
Note that related arguments could be made for some other types of events also.
Strategies will be considered in more detail in a future article.
Methods
Results were calculated for the American League regular-season games (including interleague play) for the 2010–2017 seasons.
Batting-Event Model
Retrosheet play-by-play data was used to calculate transition matrices for the (simplified) batting-event model.
These were calculated using single-season data (as done in a previous article).
An Event-Based Framework
Event Probabilities
Baseball-Reference.com was used to calculate event probabilities for the league.
Only statistics for the American League teams were included in these calculations (even in the consideration of interleague games), because it is these that are reported (e.g., see here, for the 2017 season). Results are therefore correspondingly compared (below) for runs scored only by these teams.
Transition Matrices
Retrosheet play-by-play data was also used to calculate the stochastic transition matrices.
Baserunner advancement was assumed to be league independent. This allows for more than twice as much data to be used in the calculations.
Stochastic matrices were calculated using an average of single-season ones, over five seasons; and truncated at the upper part of this range, for more recent seasons.
The deterministic matrix [Eq. (6)] was used (entirely).
Simulations
The Markov chain model was simulated one billion times.
A future article will discuss simulations versus calculations.
Results
The Markov (expected) number of runs in a half inning calculated and compared to the “actual” results are shown in the following table:
Year | Actual | Markov | Difference (%) |
---|---|---|---|
2017 | 0.529 | 0.517 (1.045) | 2.4 |
2016 | 0.508 | 0.503 (1.029) | 0.9 |
2015 | 0.491 | 0.483 (1.004) | 1.8 |
2014 | 0.466 | 0.453 (0.973) | 2.7 |
2013 | 0.483 | 0.481 (1.006) | 0.3 |
2012 | 0.4975 | 0.490 (1.014) | 1.5 |
2011 | 0.4987 | 0.489 (1.015) | 1.8 |
2010 | 0.4992 | 0.498 (1.030) | 0.1 |
The “actual” results were approximated as the average number of runs scored per game divided by the average number of innings pitched (both only by the American League teams) per game; these results were obtained from Baseball-Reference.com.
The Markov results are shown to three decimal places, as this precision is stable from simulation to simulation. The errors are shown to the same number of decimal places, only for clarity; the precise error in all cases is run.
The Markov number of runs agrees very well with the actual results, for all seasons. They are slightly underestimated in all cases though (but, as expected); on average, only by %.
In addition, the two sets of results are strongly correlated; Pearson’s correlation coefficient .
Difficult results to get correct are the subtle differences between seasons 2010–2012. The actual results are nearly identical; and, so are each shown in the above table to one additional decimal place. The Markov results, however, show two problems: (i) they do not predict the relative difference between these seasons well (compare the difference from 2010 to either of the other two seasons); moreover, (ii) the wrong ordering is predicted for seasons 2011 and 2012 [more runs are (incorrectly) predicted for the latter]. Improvements to these problems will be discussed in a future article.
Comparing these results to the (basic) Markov chain model of baseball, they are seen to be % closer (on average). It is tempting to conclude that these results are therefore (much) better. Consideration must be made in the meaning of “better”, however. Any plausible explanations attributing this to (only) the closer number of runs ultimately seem to be unjustified.
The (basic) model is “complete”, in that all batting events are precisely accounted for (though transition probabilities). In addition, since these are calculated for only the season of interest, the probabilities are modeled precisely (for that season). Indeed, these are reflected by the significantly-high correlation coefficient.
Therefore, the basic model represents a best-case scenario. It achieves this, however, at the expense of overfitting to the data; this is clear from that any transition matrix calculated for a season applies only to that season.
With this consideration, a plausible explanation for the much lower correlation (similarly, higher standard deviation), in this case, can be made in terms of the stochastic nature of the game of baseball.
Consider that, in this model, the events which constitute the game of baseball were complete, and precisely resolved. (The resolution of events in the transition matrices will be discussed in a future article.) Due to the stochasticity, however, a finite number of events over a single season may not follow this resolution precisely. This would therefore lead to a lower correlation between the model and actual data.
This argument is based on the assumption of a complete, and precisely resolved model.
For any (actual) model, there will also be an attribution to the lower correlation from the lack of either one of these properties. Considering a “baseline” model (e.g., that in this Article), though, it is expected that improvements will increase this aspect of the correlation with the actual data.
Discussion
An event-based framework for the Markov chain model of baseball was presented and discussed in detail. The fundamental idea is the decomposition of the (total) transition matrix [Eq. (1)] as a sum over individual events [Eq. (2)], products of event probabilities and transition-matrices.
Some advantages of this approach were already discussed above.
Of particular focus was that event transition-matrices are categorized into analytical or stochastic. The former were presented; and the latter were discussed to be calculable, and augmented (if need be) by a model for baserunner advancement.
A statistical analysis of the stochastic matrices will be considered in a future article.
There is much insight that could be obtained from the presented framework and application.
There is also much room for improvement. For the basic framework, this includes state-based event probabilities, as discussed in the context of strategies. Going beyond the batting-event model will require including additional effects (see above).
All of this (and more) will be considered in future articles.
References
[] B. Bukiet, E. R. Harold, and J. L. Palacios, “A Markov Chain Approach to Baseball,” Operations Research 45, 14–23 (1997)
[] D. Beaudoin, “Various applications to a more realistic baseball simulator,” Journal of Quantitative Analysis in Sports 9, 271–283 (2013)
[] M. D. Pankin, “Finding Better Batting Orders,” SABR XXI (1991)
[] D. A. D’Esopo and B. Lefkowitz, “The Distribution of Runs in the Game of Baseball,” SRI Internal Report (1960)
3 Comments
Pingback: The Markov Chain Model of Baseball | statshacker
Pingback: Statistical Analysis of the Stochastic Markov Matrices
Pingback: State-Based Event Probabilities in Baseball