Statistical Analysis of the Stochastic Markov Matrices

3

An event-based framework for the Markov chain model of baseball has been detailed in a previous article.

This is based on a decomposition of the transition matrix of the Markov chain in terms of event (transition-) matrices that describe baseball.

In this work, a statistical analysis of the stochastic event-transition matrices is made.

The central consideration is a balance between accuracy and (possible) changes in baserunner advancement from season to season.

This analysis provides a lower bound to the probabilities that transitions from any baseball state to all others are simultaneously within specified distances.

It also highlights additional considerations that must be made for event-based matrices, compared to the (total) transition matrix of the (standard) Markov model.

These results provide foundation for and insight to the event-based framework for the Markov chain model of baseball.

To cite this Article:

statshacker, “An Event-Based Framework for the Markov Chain Model of Baseball,” statshacker [http://statshacker.com/an-event-based-framework-for-the-markov-chain-model-of-baseball] Accessed: YYYY-MM-DD

statshacker, “Statistical Analysis of the Stochastic Markov Matrices,” statshacker [http://statshacker.com/statistical-analysis-of-the-stochastic-markov-matrices] Accessed: YYYY-MM-DD

Introduction

The game of baseball can be described, with remarkable accuracy, by certain probability models.

The Markov chain model of baseball is perhaps the most powerful and elegant of these.

An event-based framework for this model has been described in detail a previous article.

This is based on a decomposition of the transition matrix of the Markov chain in terms of event (transition-) matrices that describe baseball.

In this Communication, a statistical analysis of the stochastic Markov (event transition-) matrices for this framework is performed. In particular, the amount of data necessary to calculate the matrix elements, under consideration of (possible) changes in baserunner advancement from season to season.

This Communication is outlined as follows. The methods are first discussed. Results are then reported. A discussion follows. Finally, conclusions are made.

This Communication is part of the following series exploring the Markov chain model of baseball, and its utility:

Note that the theoretical approach discussed here is implemented in the quantitative computational package _statshacker by statshacker.

Methods

This section describes application of the data mining process (detailed here) to this problem.

Data Understanding

Play-by-play data was obtained from Retrosheet.

Data Preparation

Data was processed and stored in a relational database using the relational database management system PostgreSQL.

The following data preparation used DB++ as an interface to PostgreSQL, and bbDBi as an interface to the baseball database.

Event data was calculated as detailed in a previous article.

Note that no filtering by league nor game type (regular vs. postseason, or any other filtering) was considered.

Event data was calculated over up to nine seasons. (Odd numbers of seasons are needed — see below; and considering data longer than a decade seems excessive). Calculations were centered (see below) at 2013, unless otherwise specified. [This allows for use of the maximum amount of data, up to the last complete season (2017), as of this writing.]

Modeling

Stochastic event transition-matrices have been discussed in a previous article.

In the following, denote this matrix \textbf{P} for event i at season j by \textbf{P}_i^j.

This matrix, at season n, is calculated using an N-season average,

(1)   \begin{equation*} \textbf{P}_i^n = \frac{1}{N} \sum_{j = n - N/2}^{n + N/2} \textbf{P}_i^j ~~~ , \end{equation*}

centered (see also below) at n. (Hence the necessity of an odd number of seasons.)

Realize that Eq. (1) averages single-season matrices, as opposed to a total of all events over the same time period. This provides weight (equal, in this case) (see below) to seasons, rather than events. This should provide a better “balance”, since the number of events almost certainly differs from season to season as may also baserunner advancement.

Note also that Eq. (1) that implicitly weights equally seasons including and surrounding the one of current interest. The possibility of a weighted average will be considered in a future article.

Equation (1) can be evaluated directly when \operatorname{min}(n + N/2, \text{cur.}) = n + N/2 where \text{cur.} is the current season.

When \operatorname{min}(n + N/2, \text{cur.}) = \text{cur.} and n + N/2 != \text{cur.}, however, further consideration has to be made.

Two straightforward approaches to handle such situations are consideration of “one-sided” averages; as follows:

First: One-sided “partial” results may be calculated as

(2)   \begin{equation*} \textbf{P}_i^n = \frac{1}{\text{cur.} - (n - N/2) + 1} \sum_{j = n - N/2}^{\text{cur.}} \textbf{P}_i^j ~~~ . \end{equation*}

The lower part of the range is kept fixed; and the (total) range is truncated by the upper part.

Second: One-sided “full” results may be calculated as

(3)   \begin{equation*} \textbf{P}_i^n = \frac{1}{N} \sum_{j = \text{cur.} - N + 1}^{\text{cur.}} \textbf{P}_i^j ~~~ . \end{equation*}

The lower part of the range is extended to counter for the reduced range that would occur by the “partial” approach (see above).

The first approach considers changes in baserunner advancement more important than the amount of data used to calculate \textbf{P}_i^n; and vice versa for the second. Note that the amount of data is related to the accuracy to which these matrices can be calculated. Consideration of the balance between these two is of central importance herein.

Note that centered data could also continue to be used. This, however, would require truncating the lower part of the range, to account for that at the upper part. This would even further consider changes in baserunner advancement more important than the amount of data. In order to study this balance though, considering only the above two possibilities are sufficient.

These ideas form the basis for further discussion.

Evaluation

The transition from any baseball state (described here) (row i of a stochastic event transition-matrix) to all others (columns j) can be considered as specification of multinomial proportions. The problem then becomes one of sample-size determination. Following Ref. [1], this objective can be stated as: select the smallest sample size for a random sample from a multinomial population such that the probability will be at least 1 - \alpha where \alpha is the significance level that all of the estimated proportions will simultaneously be within specified distances d of the true population proportions.

Evaluation is carried out by calculating values of (average) \overline{\alpha} [over all 24 (stochastic) states (not including the three-out states, which are known exactly)]and (maximum) \operatorname{max}(\alpha) for each matrix, for three values of d = 0.15, 0.1, 0.05; the latter two are used for evaluation (see below), while the former is provided for reference. For the selected number of seasons, results for d = 0.02 and 0.01 are also reported.

Note that the following results are calculated using the total number of events summed over all matrices in Eq. (1). Since this total is expected to be similar (though, almost certainly different) among seasons [not considering things such as the 1994–95 Major League Baseball (MLB) strike], this approximation should work well.

The following criterion was used to select the number of seasons to use:

Select the minimum number of seasons such that \operatorname{max}(\alpha) < 0.01 to within d = 0.1, and \overline{\alpha} < 0.05 to within d = 0.05.

Selecting the minimum number of seasons accounts for changes in baserunning from season to season. The following conditions on \alpha as a function of d ensure that the probability that every transition is within 10% is \ge 99%; and, on average, within 5% is \ge 95%.

While the distances may seem large, it turns out that, despite the amount of data, there is a practical limitation. It will be argued below that this criterion provides a reasonable balance between (possible) changes in baserunner advancement and accuracy.

Results

Stochastic event transition-matrices were evaluated statistically.

Prior to any analysis, (it is obvious that) significance levels will decrease (i.e., a higher probability — see above) with increasing frequency of the event (\text{3B} \rightarrow \text{2B} \rightarrow \text{1B} \rightarrow \text{out}).

The (relative) amount of seasons expected [1] to be necessary to achieve similar accuracy for each event can therefore be estimated from the probabilities of events. Note that this is only an approximation, since the probabilities of events may depend on the baseball state, etc.

For 2013, for example, probabilities are 48.4, 15.4, 4.51, and 0.375 for \text{out}, \text{1B}, \text{2B}, and \text{3B}, respectively. It is therefore expected that to achieve similar accuracy for each event, 1{\times}, 3{\times}, 11{\times}, and 129{\times} the amount of data are necessary.

Such estimations will be supported below, with additional remarks for \text{3B}.

Note that, in the following tables, *\#* where \# is the number of seasons is used to mark (centered) results selected according to the above criterion; and *\# is the one-sided “partial” ones corresponding to this selection.

Out Matrix

Significance levels for the \text{out} matrix are reported in the following tables:

d = 0.15:

No. SeasonsAvg. alphaMax. alpha
11.46e-063.33e-05
300
500
700
900

d = 0.1:

No. SeasonsAvg. alphaMax. alpha
*1*4.49e-048.20e-03
36.94e-091.54e-07
500
700
900

d = 0.05:

No. SeasonsAvg. alphaMax. alpha
*1*0.03900.390
31.07e-030.0144
57.01e-051.22e-03
74.63e-069.28e-05
93.54e-077.81e-06

The matrix (all elements) is calculable to within the above criterion with only 1 season.

1B Matrix

Significance levels for the \text{1B} matrix are reported in the following tables:

d = 0.15:

No. SeasonsAvg. alphaMax. alpha
18.06e-060.0125
25.76e-068.99e-05
38.73e-081.90e-06
500
700
900

d = 0.1:

No. SeasonsAvg. alphaMax. alpha
10.01480.160
*21.16e-030.0130
*3*1.36e-042.17e-03
53.46e-065.93e-05
71.09e-072.14e-06
92.59e-095.72e-08

d = 0.05:

No. SeasonsAvg. alphaMax. alpha
10.2251
*20.07070.465
*3*0.02920.249
57.94e-030.0803
72.50e-030.0294
98.02e-040.0110

Also shown are 2-season results (2012 and 2013), for consideration below.

The matrix is calculable with 3 seasons.

Note that this is consistent with the above estimate.

2B Matrix

Significance levels for the \text{2B} matrix are reported in the following tables:

d = 0.15:

No. SeasonsAvg. alphaMax. alpha
10.05850.567
33.14e-030.0461
52.74e-045.14e-03
72.47e-055.01e-04
92.20e-064.86e-05

d = 0.1:

No. SeasonsAvg. alphaMax. alpha
10.2081
30.03070.319
*57.39e-030.0990
72.02e-030.0293
*9*5.84e-049.78e-03

d = 0.05:

No. SeasonsAvg. alphaMax. alpha
10.6301
30.2701
*50.1521
70.08230.635
*9*0.04710.417

The matrix is seen to require 9 seasons.

Note that this is also consistent with the above estimate; considering that was made using probabilities of events from only 2013 (and such change from seasons to season), and the other approximation(s) considered above.

3B Matrix

Significance levels for the \text{3B} matrix are reported in the following tables:

d = 0.15:

No. SeasonsAvg. alphaMax. alphaNo. Replaced
10.646111
30.27214
50.19614
70.09260.6130
90.0600.4500
178.52e-030.08740
195.13e-030.04110
271.45e-030.01250

d = 0.1:

No. SeasonsAvg. alphaMax. alphaNo. Replaced
10.814116
30.53216
50.34014
70.25514
*90.21013
170.06110.4530
*19*0.04610.3010
270.02180.1600

d = 0.05:

No. SeasonsAvg. alphaMax. alphaNo. Replaced
10.960122
30.845118
50.745115
70.685114
90.61719
170.37414
190.33514
270.24813

17-, 19-, and 27-season results are also shown, calculated for 2009, 2008, and 2004, respectively.

Note that the 27-season results are slightly affected by the 1994–95 MLB strike.

Also shown are the number of rows that are replaced by a model for baserunner advancement (e.g., Ref. [2]); selected here according to \alpha > 0.95 to within d = 0.1 (a very high significance level, but reasonable — see the following discussion). Note that the importance of such replacements has been discussed in a previous article here.

As suspected, the probabilities to which this matrix may be calculated are much less than for the other events. Even at 27 seasons, the above criterion is not satisfied. And, based on the above (link) estimation, 129 seasons would be impractical (e.g., to not consider changes in baserunner advancement).

An examination of the matrix elements, however, shows that they resemble the ones for model baserunning [2] quite closely; for example, for 2008 and calculated for 19 seasons worth of data, the 1s in the \textbf{P}_\text{3B} matrix [see Eq. (6) in a previous article]are lower on average by only 1.95% (not including the three-out states, and no replaced values — d = 0.1, in the above table). Such slight deviations are seen to occur, for example, when the batter-runner tries to stretch the triple, by running home.

Based on this (prior knowledge), it is reasonable to assume that the matrix elements are indeed within d = 0.1 (at least, on average), even though the actual calculation [1] (reported in the above tables) suggests only high probabilities (and not within the above criterion).

Therefore, the following more relaxed criterion is used for this event:

Select the minimum number of seasons such that \overline{\alpha} < 0.05 to within d = 0.1.

The matrix, with this criterion, requires “only” 19 seasons worth of data; for centered data, this is therefore 9 seasons on each side of that of interest. While significant, this calculation becomes practical.

Disregard prior knowledge (see above), for the sake of discussion. With the above criterion, \operatorname{max}(\alpha) = 0.301. Therefore, the probability that all matrix elements are within 10% is at least 69.9%.

And, according to the above replacement criterion, the stochastic \text{3B} matrix can be used directly (i.e., without replacement) for calculations.

Combined Results

Significance levels for centered results selected according to the above criterion are repeated (for convenient reference) in the following tables:

d = 0.1:

EventAvg. alphaMax. alpha
out4.72e-048.56e-03
1B1.36e-042.17e-03
2B5.84e-049.78e-03
3B0.04610.301

d = 0.05:

EventAvg. alphaMax. alpha
out0.0400.396
1B0.02920.249
2B0.04710.417
3B0.3351

d = 0.02:

EventAvg. alphaMax. alpha
out0.4351
1B0.4571
2B0.5051
3B0.8401

d = 0.01:

EventAvg. alphaMax. alpha
out0.7871
1B0.7971
2B0.8151
3B0.9731

One-Sided “Partial” Results

Significance levels for one-sided “partial” results [see the discussion in relation to Eq. (1)]are repeated in the following tables:

d = 0.1:

EventAvg. alphaMax. alpha
out4.72e-048.56e-03
1B1.16e-030.0130
2B7.39e-030.0990
3B0.2101

d = 0.05:

EventAvg. alphaMax. alpha
out0.0400.396
1B0.07070.465
2B0.1521
3B0.6171

Markov (Expected) Number of Runs

The Markov (expected) number of runs in a half inning calculated and compared to the “actual” results are shown in the following table:

SeasonActualMarkovDifference (%)
20170.5290.515 (1.044)2.7
20160.5080.502 (1.027)1.2
20150.4910.484 (1.006)1.5
20140.4660.454 (0.973)2.5
20130.4830.478 (1.002)0.9
20120.4980.490 (1.015)1.4
20110.4990.493 (1.020)1.1
20100.4990.498 (1.029)0.2

These results were calculated according to that as discussed in a previous article.

Comparing these sets of results reveals insight into the more “converged” data, presented herein.

The Markov number of runs agrees very well with the actual results, for all seasons. Though, they are slightly underestimated in all cases (as expected); though, on average only by 1.5(8)%.

Comparing to previous results, the differences are the same, on average. The lower standard deviation (0.8% compared to 0.9%) though shows that the results herein are more “stable”.

There is also a stronger correlation between the Markov and actual results; Pearson’s correlation coefficient r = 0.974 (compared to 0.969). This also resolves the subtle difference between 2011 and 2012 results, where there is actually a slightly higher (0.001) number of runs for the former; predicted correctly in the above table.

Discussion

A statistical analysis of the stochastic transition-matrices for the event-based framework for the Markov chain model of baseball was made.

In order to determine these matrix elements to within a reasonable specified distance with a high probability (specified above), data over 1, 3, 9, and 19 seasons was needed for the events \text{out}, \text{1B}, \text{2B}, and \text{3B}, respectively.

This is a significant amount of data, especially for the less-frequent events (\text{2B} and \text{3B}).

Given the amount of data necessary to achieve significant results, caution should be exercised in the use of that obtained over only much shorter time periods (e.g., Ref. [3]).

While the amount of data is not a problem itself, the accuracy achieved by doing this must be balanced against (possible) changes in runner advancement.

Careful thought, however, suggests a (fortunate) coincidence regarding this balance.

In particular, there must be less possible change in runner advancement with decreasing frequency of event.

Consider the least-frequent event — \text{3B}. Analogous arguments made in relation to this matrix (above) apply in this case. In this context, there are only slight changes in the outcomes (transition probabilities) that one could envision for this event (e.g., perhaps batter-runners become faster, and, as a result, a fraction more of a percent try to stretch their run home).

Similar arguments could be made for the other events.

Coincidentally then, while less-frequent events require more data (and raise the balance issue above), their change over time is less possible.

Therefore, uncertainties introduced by using less data should be more important than changes in baserunner advancement; indeed, these are quantifiably large — see above.

In passing, note that this highlights an important advantage of this framework; the separation of the probability of event (\text{out}, \text{1B}, \text{2B}, and \text{3B}, under discussion) occurring from its (stochastic, in this case) outcome (baserunner advancement). While the former may change (significantly) from season to season, the latter may not.

Consider the implication of these results for the (standard) Markov chain model of baseball.

The above results suggest that calculating the transition matrix elements for this model (e.g., over only one season) can’t resolve the underlying effects that make them up.

For this model, more error would be introduced by the less-frequent events. Furthermore, it is these which have the most impact on an event-by-event basis (e.g., the result of a triple, compared to a single).

However, there is also a rapid decrease in the probabilities of these events (see above).

Therefore, the error on the total transition matrix [Eq. (2) in a previous article] would probably be minimized; meaning that this issue is probably not a significant concern.

Resolving these effects, however, is extremely important for the event-based framework. This is because there is a relatively small spread in talent in the MLB; and so accurately determining the impact of such small differences is important.

Return to the above discussion.

Even if there are changes in baserunner advancement, the centering in Eq. (1) should average these away (to some extent). Only if there were a parabolic change in this, such that at year n there was a maximum or minimum, would this not be the case.

An important consideration related to this though is whether to use one-sided “partial” or “full” results for more recent seasons.

Note that it is these that are of interest for current and future (predictive) analysis.

This remains fundamentally based on the same important balance as above.

The one-sided “partial” results show a significant decrease in probability for less-frequent events. Matrix elements for \text{2B}, for example, can no longer all be calculated to within d = 0.05 [\operatorname{max}(\alpha) = 1] (as opposed to approximately 60% probability for the worst-case “full” result).

Consider this against the changes in baserunner advancement; and for the most-effected situation — the most recent season.

Using the “full” results, on the other hand, means that the \text{out}, \text{1B}, \text{2B}, and \text{3B} matrices are “delayed” by 0, 1, 4, and 9 seasons, respectively.

The same arguments as above (for the centered results) suggest that changes in baserunner advancement will be less important than the uncertainties introduced by using less data.

The only difference, in these cases, is that there can no longer be (centered) averaging away of changes in baserunner advancement.

This is only a minor argument, however; that follows the justification already made for the aforementioned balance.

Changes in baserunner advancement will be considered in more detail in a future article.

Conclusions

A statistical analysis of the stochastic transition-matrices for the event-based framework for the Markov chain model of baseball was made.

In order to determine these matrix elements to within a reasonable specified distance with a high probability (specified above), data over 1, 3, 9, and 19 seasons was needed for the events \text{out}, \text{1B}, \text{2B}, and \text{3B}, respectively.

The amount of data ensures that the probability that every transition is within 10% is \ge 99%; and, on average, within 5% is \ge 95%.

Considering the important balance between accuracy and (possible) changes in baserunner advancement suggests that these choices are both reasonable and practical.

This also suggests how these matrices for more recent seasons should be calculated. In particular, Eq. (3) [as opposed to (2)]should be preferred.

These results provide foundation for and insight to the event-based framework for the Markov chain model of baseball.

They also provide important information for further consideration of baserunner advancement.

References

[1] S. K. Thompson, “Sample Size for Estimating Multinomial Proportions,” The American Statistician 41, 42–26 (1987)

[2] D. A. D’Esopo and B. Lefkowitz, “The Distribution of Runs in the Game of Baseball,” SRI Internal Report (1960)

[3] D. Beaudoin, “Various applications to a more realistic baseball simulator,” Journal of Quantitative Analysis in Sports 9, 271–283 (2013)

Share.

About Author

statshacker is an Assistant Professor of Physics and Astronomy at a well-known state university. His research interests involve the development and application of concepts and techniques from the emerging field of data science to study large data sets. Outside of academic research, he is particularly interested in such data sets that arise in sports and finance. Contact: statshacker@statshacker.com