An event-based framework for the Markov chain model of baseball has been detailed in a previous article.

This is based on a decomposition of the transition matrix of the Markov chain in terms of event (transition-) matrices that describe baseball.

In this work, a statistical analysis of the stochastic event-transition matrices is made.

The central consideration is a balance between accuracy and (possible) changes in baserunner advancement from season to season.

This analysis provides a lower bound to the probabilities that transitions from any baseball state to all others are simultaneously within specified distances.

It also highlights additional considerations that must be made for event-based matrices, compared to the (total) transition matrix of the (standard) Markov model.

These results provide foundation for and insight to the event-based framework for the Markov chain model of baseball.

To cite this Article:

statshacker, “An Event-Based Framework for the Markov Chain Model of Baseball,” *statshacker* [http://statshacker.com/an-event-based-framework-for-the-markov-chain-model-of-baseball] Accessed: YYYY-MM-DD

statshacker, “Statistical Analysis of the Stochastic Markov Matrices,” *statshacker* [http://statshacker.com/statistical-analysis-of-the-stochastic-markov-matrices] Accessed: YYYY-MM-DD

## Introduction

The game of baseball can be described, with remarkable accuracy, by certain probability models.

The Markov chain model of baseball is perhaps the most powerful and elegant of these.

An event-based framework for this model has been described in detail a previous article.

This is based on a decomposition of the transition matrix of the Markov chain in terms of event (transition-) matrices that describe baseball.

In this Communication, a** statistical analysis of the stochastic Markov (event transition-) matrices** for this framework is performed. In particular, the amount of data necessary to calculate the matrix elements, under consideration of (possible) changes in baserunner advancement from season to season.

This Communication is outlined as follows. The methods are first discussed. Results are then reported. A discussion follows. Finally, conclusions are made.

This Communication is part of the following series exploring the Markov chain model of baseball, and its utility:

- The Markov Chain Model of Baseball
- An Event-Based Framework for the Markov Chain Model of Baseball
**Statistical Analysis of the Stochastic Markov Matrices**

Note that the theoretical approach discussed here is implemented in the quantitative computational package _statshacker by **statshacker**.

## Methods

This section describes application of the **data mining process** (detailed here) to this problem.

### Data Understanding

Play-by-play data was obtained from Retrosheet.

### Data Preparation

Data was processed and stored in a relational database using the relational database management system PostgreSQL.

The following data preparation used DB++ as an interface to PostgreSQL, and bbDBi as an interface to the baseball database.

Event data was calculated as detailed in a previous article.

Note that no filtering by league nor game type (regular vs. postseason, or any other filtering) was considered.

Event data was calculated over up to nine seasons. (Odd numbers of seasons are needed — see below; and considering data longer than a decade seems excessive). Calculations were centered (see below) at 2013, unless otherwise specified. [This allows for use of the maximum amount of data, up to the last complete season (2017), as of this writing.]

### Modeling

Stochastic **event transition-matrices** have been discussed in a previous article.

In the following, denote this matrix for event at season by .

This matrix, at season , is calculated using an -season average,

(1)

centered (see also below) at . (Hence the necessity of an odd number of seasons.)

Realize that Eq. (1) averages single-season matrices, as opposed to a total of all events over the same time period. This provides weight (equal, in this case) (see below) to seasons, rather than events. This should provide a better “balance”, since the number of events almost certainly differs from season to season as may also baserunner advancement.

Note also that Eq. (1) that implicitly weights equally seasons including and surrounding the one of current interest. The possibility of a weighted average will be considered in a future article.

Equation (1) can be evaluated directly when where is the current season.

When and , however, further consideration has to be made.

Two straightforward approaches to handle such situations are consideration of “one-sided” averages; as follows:

**First**: **One-sided “partial”** results may be calculated as

(2)

The lower part of the range is kept fixed; and the (total) range is truncated by the upper part.

**Second**: **One-sided “full”** results may be calculated as

(3)

The lower part of the range is extended to counter for the reduced range that would occur by the “partial” approach (see above).

The first approach considers changes in baserunner advancement more important than the amount of data used to calculate ; and vice versa for the second. Note that the amount of data is related to the accuracy to which these matrices can be calculated. Consideration of the balance between these two is of central importance herein.

Note that centered data could also continue to be used. This, however, would require truncating the lower part of the range, to account for that at the upper part. This would even further consider changes in baserunner advancement more important than the amount of data. In order to study this balance though, considering only the above two possibilities are sufficient.

These ideas form the basis for further discussion.

### Evaluation

The transition from any baseball state (described here) (row of a stochastic event transition-matrix) to all others (columns ) can be considered as specification of multinomial proportions. The problem then becomes one of **sample-size determination**. Following Ref. [], this **objective** can be stated as: select the smallest sample size for a random sample from a multinomial population such that the **probability** will be *at least* where is the **significance level** that all of the estimated proportions will simultaneously be within specified **distances** of the true population proportions.

**Evaluation** is carried out by calculating values of (average) [over all (stochastic) states (not including the three-out states, which are known exactly)]and (maximum) for each matrix, for three values of ; the latter two are used for evaluation (see below), while the former is provided for reference. For the selected number of seasons, results for and are also reported.

Note that the following results are calculated using the total number of events summed over all matrices in Eq. (1). Since this total is expected to be similar (though, almost certainly different) among seasons [not considering things such as the 1994–95 Major League Baseball (MLB) strike], this approximation should work well.

The following criterion was used to select the number of seasons to use:

**Select the minimum number of seasons such that to within , and to within .**

Selecting the minimum number of seasons accounts for changes in baserunning from season to season. The following conditions on as a function of ensure that the probability that every transition is within % is %; and, on average, within % is %.

While the distances may seem large, it turns out that, despite the amount of data, there is a practical limitation. It will be argued below that this criterion provides a reasonable balance between (possible) changes in baserunner advancement and accuracy.

## Results

Stochastic event transition-matrices were evaluated statistically.

Prior to any analysis, (it is obvious that) significance levels will decrease (i.e., a higher probability — see above) with increasing frequency of the event ().

The (relative) amount of seasons expected [] to be necessary to achieve similar accuracy for each event can therefore be *estimated* from the probabilities of events. Note that this is only an approximation, since the probabilities of events may depend on the baseball state, etc.

For 2013, for example, probabilities are , , , and for , , , and , respectively. It is therefore expected that to achieve similar accuracy for each event, , , , and the amount of data are necessary.

Such estimations will be supported below, with additional remarks for .

Note that, in the following tables, where is the number of seasons is used to mark (centered) results selected according to the above criterion; and is the one-sided “partial” ones corresponding to this selection.

### Out Matrix

Significance levels for the ** matrix** are reported in the following tables:

** = 0.15**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

1 | 1.46e-06 | 3.33e-05 |

3 | 0 | 0 |

5 | 0 | 0 |

7 | 0 | 0 |

9 | 0 | 0 |

** = 0.1**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

*1* | 4.49e-04 | 8.20e-03 |

3 | 6.94e-09 | 1.54e-07 |

5 | 0 | 0 |

7 | 0 | 0 |

9 | 0 | 0 |

** = 0.05**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

*1* | 0.0390 | 0.390 |

3 | 1.07e-03 | 0.0144 |

5 | 7.01e-05 | 1.22e-03 |

7 | 4.63e-06 | 9.28e-05 |

9 | 3.54e-07 | 7.81e-06 |

The matrix (all elements) is calculable to within the above criterion with only season.

### 1B Matrix

Significance levels for the ** matrix** are reported in the following tables:

** = 0.15**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

1 | 8.06e-06 | 0.0125 |

2 | 5.76e-06 | 8.99e-05 |

3 | 8.73e-08 | 1.90e-06 |

5 | 0 | 0 |

7 | 0 | 0 |

9 | 0 | 0 |

** = 0.1**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

1 | 0.0148 | 0.160 |

*2 | 1.16e-03 | 0.0130 |

*3* | 1.36e-04 | 2.17e-03 |

5 | 3.46e-06 | 5.93e-05 |

7 | 1.09e-07 | 2.14e-06 |

9 | 2.59e-09 | 5.72e-08 |

** = 0.05**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

1 | 0.225 | 1 |

*2 | 0.0707 | 0.465 |

*3* | 0.0292 | 0.249 |

5 | 7.94e-03 | 0.0803 |

7 | 2.50e-03 | 0.0294 |

9 | 8.02e-04 | 0.0110 |

Also shown are -season results (2012 and 2013), for consideration below.

The matrix is calculable with seasons.

Note that this is consistent with the above estimate.

### 2B Matrix

Significance levels for the ** matrix** are reported in the following tables:

** = 0.15**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

1 | 0.0585 | 0.567 |

3 | 3.14e-03 | 0.0461 |

5 | 2.74e-04 | 5.14e-03 |

7 | 2.47e-05 | 5.01e-04 |

9 | 2.20e-06 | 4.86e-05 |

** = 0.1**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

1 | 0.208 | 1 |

3 | 0.0307 | 0.319 |

*5 | 7.39e-03 | 0.0990 |

7 | 2.02e-03 | 0.0293 |

*9* | 5.84e-04 | 9.78e-03 |

** = 0.05**:

No. Seasons | Avg. alpha | Max. alpha |
---|---|---|

1 | 0.630 | 1 |

3 | 0.270 | 1 |

*5 | 0.152 | 1 |

7 | 0.0823 | 0.635 |

*9* | 0.0471 | 0.417 |

The matrix is seen to require seasons.

Note that this is also consistent with the above estimate; considering that was made using probabilities of events from only 2013 (and such change from seasons to season), and the other approximation(s) considered above.

### 3B Matrix

Significance levels for the ** matrix** are reported in the following tables:

** = 0.15**:

No. Seasons | Avg. alpha | Max. alpha | No. Replaced |
---|---|---|---|

1 | 0.646 | 1 | 11 |

3 | 0.272 | 1 | 4 |

5 | 0.196 | 1 | 4 |

7 | 0.0926 | 0.613 | 0 |

9 | 0.060 | 0.450 | 0 |

17 | 8.52e-03 | 0.0874 | 0 |

19 | 5.13e-03 | 0.0411 | 0 |

27 | 1.45e-03 | 0.0125 | 0 |

** = 0.1**:

No. Seasons | Avg. alpha | Max. alpha | No. Replaced |
---|---|---|---|

1 | 0.814 | 1 | 16 |

3 | 0.532 | 1 | 6 |

5 | 0.340 | 1 | 4 |

7 | 0.255 | 1 | 4 |

*9 | 0.210 | 1 | 3 |

17 | 0.0611 | 0.453 | 0 |

*19* | 0.0461 | 0.301 | 0 |

27 | 0.0218 | 0.160 | 0 |

** = 0.05**:

No. Seasons | Avg. alpha | Max. alpha | No. Replaced |
---|---|---|---|

1 | 0.960 | 1 | 22 |

3 | 0.845 | 1 | 18 |

5 | 0.745 | 1 | 15 |

7 | 0.685 | 1 | 14 |

9 | 0.617 | 1 | 9 |

17 | 0.374 | 1 | 4 |

19 | 0.335 | 1 | 4 |

27 | 0.248 | 1 | 3 |

-, -, and -season results are also shown, calculated for 2009, 2008, and 2004, respectively.

Note that the -season results are slightly affected by the 1994–95 MLB strike.

Also shown are the number of rows that are replaced by a model for baserunner advancement (e.g., Ref. []); selected here according to to within (a very high significance level, but reasonable — see the following discussion). Note that the importance of such replacements has been discussed in a previous article here.

As suspected, the probabilities to which this matrix may be calculated are much less than for the other events. Even at seasons, the above criterion is not satisfied. And, based on the above (link) estimation, seasons would be impractical (e.g., to not consider changes in baserunner advancement).

An examination of the matrix elements, however, shows that they resemble the ones for model baserunning [] quite closely; for example, for 2008 and calculated for seasons worth of data, the s in the matrix [see Eq. (6) in a previous article]are lower on average by only % (not including the three-out states, and no replaced values — , in the above table). Such slight deviations are seen to occur, for example, when the batter-runner tries to stretch the triple, by running home.

Based on this (prior knowledge), it is reasonable to assume that the matrix elements are indeed within (at least, on average), even though the actual calculation [] (reported in the above tables) suggests only high probabilities (and not within the above criterion).

Therefore, the following more relaxed criterion is used for this event:

**Select the minimum number of seasons such that to within .**

The matrix, with this criterion, requires “only” seasons worth of data; for centered data, this is therefore seasons on each side of that of interest. While significant, this calculation becomes practical.

Disregard prior knowledge (see above), for the sake of discussion. With the above criterion, . Therefore, the probability that *all* matrix elements are within % is at least %.

And, according to the above replacement criterion, the stochastic matrix can be used directly (i.e., without replacement) for calculations.

### Combined Results

Significance levels for centered results selected according to the above criterion are *repeated* (for convenient reference) in the following tables:

** = 0.1**:

Event | Avg. alpha | Max. alpha |
---|---|---|

out | 4.72e-04 | 8.56e-03 |

1B | 1.36e-04 | 2.17e-03 |

2B | 5.84e-04 | 9.78e-03 |

3B | 0.0461 | 0.301 |

** = 0.05**:

Event | Avg. alpha | Max. alpha |
---|---|---|

out | 0.040 | 0.396 |

1B | 0.0292 | 0.249 |

2B | 0.0471 | 0.417 |

3B | 0.335 | 1 |

** = 0.02**:

Event | Avg. alpha | Max. alpha |
---|---|---|

out | 0.435 | 1 |

1B | 0.457 | 1 |

2B | 0.505 | 1 |

3B | 0.840 | 1 |

** = 0.01**:

Event | Avg. alpha | Max. alpha |
---|---|---|

out | 0.787 | 1 |

1B | 0.797 | 1 |

2B | 0.815 | 1 |

3B | 0.973 | 1 |

### One-Sided “Partial” Results

Significance levels for **one-sided “partial” results** [see the discussion in relation to Eq. (1)]are repeated in the following tables:

** = 0.1**:

Event | Avg. alpha | Max. alpha |
---|---|---|

out | 4.72e-04 | 8.56e-03 |

1B | 1.16e-03 | 0.0130 |

2B | 7.39e-03 | 0.0990 |

3B | 0.210 | 1 |

** = 0.05**:

Event | Avg. alpha | Max. alpha |
---|---|---|

out | 0.040 | 0.396 |

1B | 0.0707 | 0.465 |

2B | 0.152 | 1 |

3B | 0.617 | 1 |

### Markov (Expected) Number of Runs

The Markov (expected) number of runs in a half inning calculated and compared to the “actual” results are shown in the following table:

Season | Actual | Markov | Difference (%) |
---|---|---|---|

2017 | 0.529 | 0.515 (1.044) | 2.7 |

2016 | 0.508 | 0.502 (1.027) | 1.2 |

2015 | 0.491 | 0.484 (1.006) | 1.5 |

2014 | 0.466 | 0.454 (0.973) | 2.5 |

2013 | 0.483 | 0.478 (1.002) | 0.9 |

2012 | 0.498 | 0.490 (1.015) | 1.4 |

2011 | 0.499 | 0.493 (1.020) | 1.1 |

2010 | 0.499 | 0.498 (1.029) | 0.2 |

These results were calculated according to that as discussed in a previous article.

Comparing these sets of results reveals insight into the more “converged” data, presented herein.

The Markov number of runs agrees very well with the actual results, for all seasons. Though, they are slightly underestimated in all cases (as expected); though, on average only by %.

Comparing to previous results, the differences are the same, on average. The lower standard deviation (% compared to %) though shows that the results herein are more “stable”.

There is also a stronger correlation between the Markov and actual results; Pearson’s correlation coefficient (compared to ). This also resolves the subtle difference between 2011 and 2012 results, where there is actually a slightly higher () number of runs for the former; predicted correctly in the above table.

## Discussion

A statistical analysis of the stochastic transition-matrices for the event-based framework for the Markov chain model of baseball was made.

In order to determine these matrix elements to within a reasonable specified distance with a high probability (specified above), data over , , , and seasons was needed for the events , , , and , respectively.

This is a significant amount of data, especially for the less-frequent events ( and ).

Given the amount of data necessary to achieve significant results, caution should be exercised in the use of that obtained over only much shorter time periods (e.g., Ref. []).

While the amount of data is not a problem itself, the accuracy achieved by doing this must be balanced against (possible) changes in runner advancement.

Careful thought, however, suggests a (fortunate) coincidence regarding this balance.

In particular, there must be less *possible* change in runner advancement with decreasing frequency of event.

Consider the least-frequent event — . Analogous arguments made in relation to this matrix (above) apply in this case. In this context, there are only slight changes in the outcomes (transition probabilities) that one could envision for this event (e.g., perhaps batter-runners become faster, and, as a result, a fraction more of a percent try to stretch their run home).

Similar arguments could be made for the other events.

Coincidentally then, while less-frequent events require more data (and raise the balance issue above), their change over time is less possible.

Therefore, uncertainties introduced by using less data should be more important than changes in baserunner advancement; indeed, these are quantifiably large — see above.

In passing, note that this highlights an important advantage of this framework; the separation of the probability of event (, , , and , under discussion) occurring from its (stochastic, in this case) outcome (baserunner advancement). While the former may change (significantly) from season to season, the latter may not.

Consider the implication of these results for the (standard) Markov chain model of baseball.

The above results suggest that calculating the transition matrix elements for this model (e.g., over only one season) can’t resolve the underlying effects that make them up.

For this model, more error would be introduced by the less-frequent events. Furthermore, it is these which have the most impact on an event-by-event basis (e.g., the result of a triple, compared to a single).

However, there is also a rapid decrease in the probabilities of these events (see above).

Therefore, the error on the total transition matrix [Eq. (2) in a previous article] would probably be minimized; meaning that this issue is probably not a significant concern.

Resolving these effects, however, is extremely important for the event-based framework. This is because there is a relatively small spread in talent in the MLB; and so accurately determining the impact of such small differences is important.

Return to the above discussion.

Even if there are changes in baserunner advancement, the centering in Eq. (1) should average these away (to some extent). Only if there were a parabolic change in this, such that at year there was a maximum or minimum, would this not be the case.

An important consideration related to this though is whether to use one-sided “partial” or “full” results for more recent seasons.

Note that it is these that are of interest for current and future (predictive) analysis.

This remains fundamentally based on the same important balance as above.

The one-sided “partial” results show a significant decrease in probability for less-frequent events. Matrix elements for , for example, can no longer all be calculated to within [] (as opposed to approximately % probability for the worst-case “full” result).

Consider this against the changes in baserunner advancement; and for the most-effected situation — the most recent season.

Using the “full” results, on the other hand, means that the , , , and matrices are “delayed” by , , , and seasons, respectively.

The same arguments as above (for the centered results) suggest that changes in baserunner advancement will be less important than the uncertainties introduced by using less data.

The only difference, in these cases, is that there can no longer be (centered) averaging away of changes in baserunner advancement.

This is only a minor argument, however; that follows the justification already made for the aforementioned balance.

Changes in baserunner advancement will be considered in more detail in a future article.

## Conclusions

A statistical analysis of the stochastic transition-matrices for the event-based framework for the Markov chain model of baseball was made.

In order to determine these matrix elements to within a reasonable specified distance with a high probability (specified above), data over , , , and seasons was needed for the events , , , and , respectively.

The amount of data ensures that the probability that every transition is within % is %; and, on average, within % is %.

Considering the important balance between accuracy and (possible) changes in baserunner advancement suggests that these choices are both reasonable and practical.

This also suggests how these matrices for more recent seasons should be calculated. In particular, Eq. (3) [as opposed to (2)]should be preferred.

These results provide foundation for and insight to the event-based framework for the Markov chain model of baseball.

They also provide important information for further consideration of baserunner advancement.

## References

[] S. K. Thompson, “Sample Size for Estimating Multinomial Proportions,” *The American Statistician* **41**, 42–26 (1987)

[] D. A. D’Esopo and B. Lefkowitz, “The Distribution of Runs in the Game of Baseball,” SRI Internal Report (1960)

[] D. Beaudoin, “Various applications to a more realistic baseball simulator,” *Journal of Quantitative Analysis in Sports* **9**, 271–283 (2013)

## 3 Comments

Pingback: An Event-Based Framework for the Markov Chain Model of Baseball

Pingback: The Markov Chain Model of Baseball

Pingback: State-Based Event Probabilities in Baseball