State-Based Event Probabilities in Baseball

0

Baseball statistics provide information about the probabilities of events that constitute the game of baseball.

As they are collected over a finite period of time, they only provide this in an average sense.

At a finer level, one may be interested in how these probabilities depend on the baseball state.

This provides further information, about the mechanics of baseball.

In this work, a method to accurately determine state-based event probabilities is developed.

As an example application, state-based event probabilities are then determined, and discussed for 2014.

These results provide important quantitative information.

To cite this Article:

statshacker, “State-Based Event Probabilities in Baseball,” statshacker [http://statshacker.com/state-based-event-probabilities-in-baseball] Accessed: YYYY-MM-DD

Introduction

The discrete, well-defined and relatively “clean” structure of the game of baseball was discussed in a previous article, in terms of baseball states and transitions between them.

As discussed in relation to the event-based framework for the Markov chain model of baseball, this structure can be described in more detail in terms of the cause of these transitions; which are the events which constitute the game.

In order to understand the mechanics of baseball, the probabilities of these events at any time must be known.

A first-order approximation would be that these are given by (standard) baseball statistics.

In order to improve upon this, state-based event probabilities may be considered.

Qualitatively, some aspects of this are known; an intentional base on balls is more likely to occur when first base is not occupied, a sacrifice fly with a baserunner on third base, and less than two outs, etc.

Quantitatively, however, the associated probabilities are not.

More subtle details are also unknown; e.g., minor adjustments, given a general state.

Quantitative information is important for (accurate) baseball analytics.

In this work, a method to accurately determine state-based event probabilities is developed.

This Article is outlined as follows. The methods used to calculate state-based event probabilities are first developed. Following this, a batting-event model for baseball is presented. A statistical analysis of the probability-factor matrix (the central object developed herein) for this model is then presented. State-based event probabilities (in terms of this matrix) are then reported for 2014. A discussion and conclusions follows.

Methods

This section describes application of the data mining process (detailed here) to this problem.

Data Understanding

Play-by-play data was obtained from Retrosheet.

Data Preparation

Data was processed and stored in a relational database using the relational database management system PostgreSQL.

The following data preparation used DB++ as an interface to PostgreSQL, and bbDBi as an interface to the baseball database.

Event data was calculated from this database. Additional details will be discussed below, in the context of the model considered herein.

Note that no filtering by league nor game type (regular vs. postseason, or any other filtering) was considered.

Event data was calculated over up to nine seasons. (Odd numbers of seasons are needed — see below; and considering data longer than a decade seems excessive). Calculations were centered (see below) at 2013, unless otherwise specified. [This allows for use of the maximum amount of data, up to the last complete season (2017), as of this writing.]

Modeling

Sample Space

Denote the sample space for a (general) event-based model of baseball as S,

(1)   \begin{equation*} S = \{\text{e}_1, \text{e}_2, \ldots, \text{e}_{N_\text{e}}\} \end{equation*}

where \text{e}_j denotes event (\text{e}) j = 1, 2, \ldots N_\text{e} (N_\text{e} is the total number of events). Note the non-zero offset of these events.

Note that j (alone) will be used below to denote the associated event.

Probabilities

Consider the events of S as outcomes of a random phenomenon.

A discrete random variable X can then be defined on S. The probability mass function (pmf) of X is defined as

    \begin{equation*} p_X(j) = \operatorname{Pr}(X = j) ~~~ , \end{equation*}

where \operatorname{Pr}(X = j) is the probability that X = j.

Note that the notation p here (as opposed to the more common one of f) is used, for clarity in notation below.

Note also that since there is only one random variable (of focus), the subscript X will be hereon omitted.

Since p(x) is a pmf,

(2)   \begin{align*} 0 \ge p(j) \le 1 ~~~ \text{for all} ~ j \\ \sum_j p(j) = 1 ~~~ . \end{align*}

For reference below, it will be convenient to write the pmf as a probability vector \textbf{p},

    \[ \label{eq:p} \textbf{p}^\mathrm{T} = \begin{bmatrix} p(1) \\ p(2) \\ \vdots \\ p(N_\text{e}) \end{bmatrix} ~~~ . \]

For state-based event probabilities, there is a dependence on the state i.

In this case, p is a conditional probability mass function, denoted by

    \begin{equation*} p(j|i) = \operatorname{Pr}(X = j|Y = i) ~~~ , \end{equation*}

where Y is a discrete random variable with range \{0, 1, \ldots, 27\}.

Only for states i = 023, however, can X [and hence be p(j|i)]be (meaningfully) defined in this way.

States i = 2427 are deterministic; in other words, the events in S can no longer be considered as outcomes of a random phenomenon.

For future consideration (see below), it is convenient though to include these states with the same mathematical formulation as the others.

Since these states are deterministic, the specifications of p(j|i \ge 24) are unimportant; the only requirement is to to maintain p as a valid pmf [Eq. (2)].

The most convenient choice (see below), and without loss of generality, is to take these elements to be p(j) [i.e., total (state-independent) probabilities].

The details of this will be further remarked upon in a future article. Herein, results for these state (which are not important in this context) will not be explicitly reported.

It is convenient to then assemble these probabilities into a probability matrix \textbf{Pr},

(3)   \begin{equation*} \textbf{Pr} = \begin{bmatrix} p(1|0)          & p(2|0)          & \ldots  & p(N_\text{e}|0) \\ p(1|1)          & p(2|1)           & \ldots  & p(N_\text{e}|1) \\ \vdots          & \vdots           & \ddots & \vdots \\ p(1|23)        & p(2|23)         & \ldots  & p(N_\text{e}|23) \\ \textbf{p}(1) & \textbf{p}(2) & \ldots  & \textbf{p}(N_\text{e}) \end{bmatrix} \end{equation*}

where \textbf{p}(j) are 4{\times}1 blocks of p(j).

Data Amount/Change Dilemma

Given data, the matrix in Eq. (3) could be calculated directly. Yet, while this would indeed precisely describe the data, many elements would be subject to large errors.

Consider first the amount of data; and, for example, for the entire 2013 season and for the batting events considered below. State (\varnothing,0) [in (B,o) notation — see here]occurred 45,589 times; whereas state (3,0) occurred only 461 times (nearly 100{\times} less).The low amount of data (for particular states) leads to large uncertainties in simultaneous estimation of probabilities (for those states).

Considering additional data would solve this problem. However, this would not account for changes in the data (probabilities) over time.

Considering these two aspects together represents the data amount/change dilemma. Problematic is that for certain events (e.g., strikeout rate, in recent seasons), the change in probability per season is larger than the accuracy to which all events can be simultaneously calculated.

Probability Factors

instead of probabilities, consider probability factors f.

The probability factor f(j|i) is defined as the probability of event j when in state i, relative to the total probability,

(4)   \begin{align*} f(j|i) &= \frac{p(j|i)}{\sum_{i = 0}^{27} p_Y(i) p(j|i)} \\ &= \frac{p(j|i)}{p(j)} \end{align*}

where p_Y(i) is the pmf of Y. Note that the total probability in the second line has been obtained by marginalizing out the state information.

It is convenient to assemble these factors also into a probability-factor matrix \textbf{F},

(5)   \begin{equation*} \textbf{F} = \begin{bmatrix} f(1|0)          & f(2|0)         & \ldots  & f(N_\text{e}|0) \\ f(1|1)          & f(2|1)         & \ldots  & f(N_\text{e}|1) \\ \vdots         & \vdots        & \ddots & \vdots \\ f(1|23)        & f(2|23)       & \ldots  & f(N_\text{e}|23) \\ \mathbf{1} & \mathbf{1} & \ldots  & \mathbf{1} \\ \end{bmatrix} \end{equation*}

where \mathbf{1} are 4{\times}1 blocks of 1s.

Notice now the “convenience” of taking the elements to be p(j) for states i = 2427 (see above).

Recognize the fundamental differences between probability factors and the probabilities which they are calculated from: whereas the latter certainly change over time, the former may not.

The following assumption of uniformity is therefore made:

Probability factors change only slowly over time.

This assumption implies that changes in probabilities are uniformly distributed up to some tolerance [which allows for (slow) changes — considered below]over the states.

This assumption seems reasonable; changes would imply those in the underlying mechanics of the game of baseball, which are expected to change slowly over time. This assumption will be verified below.

This assumption solves the second problem in the data amount/change dilemma (changes in the data).

Calculating Probability Factors

With the second problem in the data amount/change dilemma solved (see just above), the first (amount of data) can also be.

\textbf{F} is therefore calculated using data over several seasons. (Recall that a single season is insufficient for some states.)

In the following, denote this matrix at season k by \textbf{F}^k.

This matrix, at season n, is calculated using an N-season average,

(6)   \begin{equation*} \textbf{F}^n = \frac{1}{N} \sum_{k = n - N/2}^{n + N/2} \textbf{F}^k ~~~ , \end{equation*}

centered (see also below) at n. (Hence the necessity mentioned above of an odd number of seasons.)

Justification for an average of single-season matrices, as opposed to a total of all events over the same time period has been given for runner advancement (see the discussion in relation to Eq. (1) in this article); since analogous arguments (number of events per season, and possible changes in probability factors — see above) apply in this case.

Equation (6) can be evaluated directly when \operatorname{min}(n + N/2, \text{cur.}) = n + N/2 where \text{cur.} is the current season.

When \operatorname{min}(n + N/2, \text{cur.}) = \text{cur.} and n + N/2 \neq \text{cur.}, further consideration has to be made.

Based on previous consideration (in the context of runner advancement; analogous arguments again apply in this case), the optimal approach should be one-sided “full” results, calculated as

    \begin{equation*} \textbf{F}^n = \frac{1}{N} \sum_{k = \text{cur.} - N + 1}^{\text{cur.}} \textbf{F}^k ~~~ , \end{equation*}

where the lower part of the range is extended to counter for the reduced (total) range that occurs by the truncation by the upper part. That is, this approach considers (possible) changes in probability factors less important than the amount of data used to calculate \textbf{F}.

Consideration of the balance between these aspects is of central importance herein.

Extracting Probability Factors

In order to extract probability factors (for a state of interest), the following approach can be used.

Define the 1{\times}28 (definite) state vector that describes the baseball state (see here) i by \textbf{x}_i,

    \[ \textbf{x}_i^\mathrm{T} = \begin{bmatrix} 0 \\ \vdots \\ i = 1 \\ \vdots \\ 0 \end{bmatrix} \]

(i.e., all elements are 0, except for i which is 1).

The 1{\times}N_\text{e} probability-factor vector \textbf{f}_i is then

(7)   \begin{equation*} \textbf{f}_i = \textbf{x}_i \textbf{F} ~~~ ; \end{equation*}

i.e., the i^\text{th} row of \textbf{F}.

Probability Factors to Probabilities

\textbf{F} can be used to calculate state-based probabilities, given a probability vector \textbf{p} [Eq. (??); i.e., defined over all states].

“Probabilities” (quotations used for reasons described below) \textbf{p}'_i for state i are first calculated as

(8)   \begin{equation*} \textbf{p}'_i = \textbf{p} \operatorname{diag}(\textbf{f}_i) \end{equation*}

where \operatorname{diag}(\textbf{z}) (\textbf{z} is a generic vector) is an operator that creates a matrix with \textbf{z} as its main diagonal.

In general,

(9)   \begin{equation*} \sum_j^{N_\text{e}} p'_i(j) = 1 + \epsilon \approx 1 \end{equation*}

where \epsilon is a small component, for any state i < 24.

Only for \textbf{p} calculated over the same events as for \textbf{F} (or by coincidence) would \epsilon = 0 and the approximately in Eq. (9) become an equality; as evident from the definition in Eq. (4).

Note that this approximate (as opposed to a not equal to, of implied larger magnitude) is a result of that, in baseball, the spread in statistics (no matter how considered — over seasons, leagues, teams, or players) is small.

Note further that this does not imply any approximation in the method itself; such will be considered below.

In order to consider \textbf{p}'_i as a probability vector, however, the equality in Eq. (9) must hold; i.e., the second condition of Eq. (2) (as does the first condition).

In the context of probability [in particular, the constraint that the probabilities must sum to 1 — the second condition in Eq. (2)], what is important is relative magnitudes. In the context herein, important is therefore the relative magnitudes of probability factors.

Simple normalization imposes this constraint, and conserves relative magnitudes,

(10)   \begin{equation*} p_i(j) = \frac{p'_i(j)}{\sum_{j'} p'_i(j')} ~~~ ; \end{equation*}

i.e.,

    \begin{equation*} \frac{p_i(j)}{p_i(l)} = \frac{p'_i(j)}{p'_i(l)} \end{equation*}

for any two events j and l.

Application of Probability Factors (Probabilities)

Implicit in the application of Eqs. (8) and (10) is an assumption that should be made explicit.

Using \textbf{F} to calculate state-based probabilities given \textbf{p} (see above) assumes that the probability factors between the two would be (had the latter actually been calculated) equivalent.

This is the only assumption (except that of uniformity — see above) in the above framework.

Considering again the spread in statistics (see above), this assumption seems reasonable.

Evaluation

The probabilities (not factors) of events from any baseball state (row i of \textbf{Pr}) can be considered as specification of multinomial proportions. The problem then becomes one of sample-size determination. Following Ref. [1], this objective can be stated as: select the smallest sample size for a random sample from a multinomial population such that the probability will be at least 1 - \alpha where \alpha is the significance level that all of the estimated proportions will simultaneously be within specified distances d of the true population proportions.

Calculations are carried out for distances d_j = d for all j = 1, \ldots, N_\text{e}, and no prior knowledge is assumed about the parameters [except for calculating uncertainties when p(j|i) = 0 — see below]. Remark this also on the statistical analysis of stochastic matrices. Need to also remark therein on prior knowledge regarding 3B, with reference to 1.

Evaluation

Evaluation is carried out by calculating values of (average) \overline{\alpha} [over all 24 (stochastic) states]and (maximum) \operatorname{max}(\alpha) for each \textbf{Pr} matrix, for three values of d = 0.05, 0.02, 0.01; these are sufficient to use for evaluation (see below).

Note that results are calculated using the total number of events summed over all matrices in Eq. (3). Since this total is expected to be similar (though, almost certainly different) among seasons [not considering things such as the 1994–95 Major League Baseball (MLB) strike], this approximation should work well.

Consider the aforementioned balance between accuracy and (possible) change in probability factors.

If the relative rate of change of probability factors to probabilities could be specified (it is unknown), then so could a criterion to achieve this balance. Assuming such a specification, the following criterion (optimally) does so:

Select the minimum number of seasons s such that \operatorname{max}(\alpha) < 0.05 to within d = d_\text{max}

where d_\text{max} is the maximum change in any probability factor.

The purpose of this criterion is to balance the change in probability factors against the accuracy in its errors.

Herein, the following relative rate of change (which is unknown) is assumed:

The change in probability factors is an order of magnitude slower than the probabilities. 

Uncertainty

The approach in Ref. [1] can also be used to calculate the uncertainty for probability factors.

This is done by first calculating uncertainties for the probabilities p(j|i) and p(j).

Propagation of uncertainty is then used to calculate the uncertainties of the factors; using the standard formula [2]. Note that, under the assumption of uniformity (see here), p(j|i) and p(j) are assumed to have a perfect positive correlation.

For p(j|i) = 0 (precisely), prior knowledge is assumed (i.e., that this event cannot occur for this state), and 0 uncertainty is also reported; for p(j|i) = 0 (not precisely), the calculated uncertainty is reported.

Example Application: Batting-Event Model

In a previous article, a (simplified) batting-event model for baseball was developed.

This model was “simplified”, in the sense that the sample space was incomplete.

Strategies

Particular batting events omitted (as events in the sample space) were those that are state dependent. These were discussed in the context of strategies.

With state-based probabilities, it is possible to expand this model to account (at least partially — see below) for and consider these events (or effects — see below).

These events include the following:

  • \text{IBB}
  • \text{SH}
  • \text{SF}

where \text{IBB} is intentional base on balls, and \text{SH} and \text{SF} are sacrifice hits and flys, respectively.

Note that, following Retrosheet, \text{SH} and \text{SF} are (technically) not considered events; in the framework herein, however, they are.

\text{IBB}

Including \text{IBB} partially accounts for this defensive strategy.

Accounted for (herein) is the state dependence of this event (this will be discussed below).

Not directly (see below) accounted for, however, is that they also occur most frequently with an excellent batter at the plate and a significantly worse one on deck.

This effect though is (partially — see the following assumption) indirectly accounted for, on average, through the statistic \text{IBB}. Assume (for discussion) that the probability of the baseball state is the same for all batters. If a particular batter then has a high \text{IBB}, for example, then they in general (i.e., on average) are expected to occur in a position in a lineup where they are considered an excellent batter, followed by a significantly worse one.

\text{SH} and \text{SF}

Including \text{SH} and \text{SF} events accounts for this offensive strategy (sacrifice plays).

Accounted for is the state dependence of these events.

This is sufficient for \text{SF} and the baserunner advancement aspect of \text{SH} (see the following comment).

Not accountable for in this framework are squeeze plays in \text{SH}. Based on consideration of probabilities alone, this is not possible to consider. These generally occur in late innings of a close game.

Note that squeeze plays are included in the following calculations; with these caveats in mind.

Sample Space

The sample space for this model is as follows:

(11)   \begin{equation*} S = \{\text{K}, \text{BB}, \text{IBB}, \text{HBP}, \text{out}, \text{1B}, \text{2B}, \text{3B}, \text{HR}, \text{SH}, \text{SF}\} \end{equation*}

where events not defined above have been done so previously.

Statistical Analysis

Probability-factor matrices [Eq. (5)] were evaluated statistically.

Note that, in the following tables, *\#* where \# is the number of seasons is used to mark (centered) results selected according to the above criterion.

Maximum Probability Change

Recall that the maximum probability change is needed to estimate the change in probability factors.

The event with maximum probability change is taken to be strikeout rate.

The change in strikeout rate \Delta \text{K} for Major League Baseball over a number of seasons centered at 2013 are shown in the following table:

No. SeasonsΔK
10
35.79e-03
50.0177
70.0263
90.0369

Significance Levels

Significance levels are reported in the following tables:

d = 0.05:

No. SeasonsAvg. alphaMax. alpha
15.40e-030.0683
31.12e-052.08e-04
55.59e-081.25e-06
72.71e-106.49e-09
900

d = 0.02:

No. SeasonsAvg. alphaMax. alpha
10.2231
30.03450.299
59.48e-030.103
*7*2.89e-030.0361
99.99e-040.0146

d = 0.01:

No. SeasonsAvg. alphaMax. alpha
10.6391
30.2901
50.1731
70.09740.695
90.05950.486

Comparing these table to that of maximum probability change shows that it is for 7 seasons that \operatorname{max}(\alpha) < 0.05 to within d = d_\text{max} = 0.02 (conservative estimate).

Example: Probability Factors (2014)

Probability factors are reported in the following table for 2014:

State No.: (B,o)KBBIBBHBPout1B2B3BHRSHSF
0: (∅,0)0.983(2)0.923(6)0(0)0.94(5)1.0439(9)1.015(3)1.063(9)1.08(8)1.11(1)0(0)0(0)
1: (1,0)0.826(7)0.77(2)0(0)1.0(2)1.001(3)1.098(9)1.02(3)0.8(3)1.09(5)7.6(2)0(0)
2: (2,0)0.82(1)1.13(4)0.8(5)1.2(3)0.962(6)1.00(2)0.98(6)0.8(6)0.8(1)9.4(1)0.0(5)
3: (3,0)0.89(4)1.2(1)1(1)1.0(9)0.74(2)1.14(5)1.0(2)1(2)1.0(3)0(1)18.48(5)
4: (12,0)0.86(2)0.81(4)0(0)1.0(4)0.964(7)1.01(2)0.94(7)0.8(7)1.0(1)11.4(2)0.0(5)
5: (13,0)0.81(3)0.74(8)1(1)1.0(6)0.73(1)1.16(4)1.1(1)1(1)1.2(2)3.2(7)21.0(4)
6: (23,0)0.88(3)1.11(9)7.2(8)1.1(8)0.72(1)1.05(4)0.9(2)1(1)0.9(3)0(1)18.7(1)
7: (123,0)0.91(3)0.72(9)0(0)1.2(7)0.78(1)1.13(4)1.1(1)1(1)0.9(2)0(0)21.4(3)
8: (∅,1)1.062(3)0.966(8)0(0)0.91(7)1.026(1)0.991(4)0.99(1)1.0(1)0.98(2)0(0)0(0)
9: (1,1)0.909(6)0.86(2)0(0)1.1(1)1.023(3)1.109(8)1.05(3)1.0(3)1.04(4)2.46(9)0(0)
10: (2,1)0.99(1)1.36(3)4.0(1)1.2(2)0.967(4)0.93(1)0.91(5)1.0(4)0.91(8)0.2(3)0.0(4)
11: (3,1)0.89(2)1.31(5)5.5(3)1.5(4)0.663(8)1.10(2)0.99(8)1.3(8)0.9(1)0.9(5)18.7(5)
12: (12,1)0.94(1)0.97(3)0.0(5)1.0(3)1.031(5)1.03(1)1.00(5)0.9(5)1.04(8)2.1(3)0.0(4)
13: (13,1)0.82(2)0.81(5)1.1(7)1.3(4)0.754(8)1.12(2)1.07(8)1.0(8)1.0(1)2.8(4)20.2(7)
14: (23,1)0.85(2)1.11(6)20.6(7)1.6(4)0.635(9)0.97(3)0.90(9)1.2(9)0.6(2)0.4(6)15.9(3)
15: (123,1)0.89(2)0.78(6)0(0)1.1(5)0.782(9)1.14(3)1.03(9)1.4(8)1.0(1)0.1(6)20.3(6)
16: (∅,2)1.105(3)1.078(9)0.0(2)0.92(8)1.002(1)0.959(5)0.97(2)0.9(2)0.98(3)0(0)0(0)
17: (1,2)1.003(6)0.95(2)0.0(3)1.0(1)1.035(3)1.028(8)1.02(3)1.0(3)1.05(5)0(0)0(0)
18: (2,2)1.045(8)1.40(2)8.9(3)1.1(2)0.924(4)0.85(1)0.88(4)1.1(4)0.77(7)0(0)0(0)
19: (3,2)1.04(1)1.48(4)6.73(9)1.2(3)0.937(6)0.84(2)0.94(7)1.0(6)0.7(1)0(0)0(0)
20: (12,2)1.09(1)1.07(3)0.2(4)1.2(2)1.022(4)0.92(1)0.90(5)1.1(4)0.93(8)0(0)0(0)
21: (13,2)1.01(2)1.09(4)0.8(6)1.2(4)1.024(7)0.98(2)0.98(7)1.1(7)0.9(1)0(0)0(0)
22: (23,2)1.07(2)1.42(5)14.1(3)1.0(4)0.901(8)0.75(3)0.73(9)0.9(8)0.7(1)0(0)0(0)
23: (123,2)1.13(2)0.94(5)0(0)1.1(4)1.026(8)0.91(2)0.99(8)1.6(7)0.9(1)0(0)0(0)
24: (*,3;0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
25: (*,3;1)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
26: (*,3;2)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
27: (*,3;3)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)

Note that this is the most recent season for which centered data [Eq. (6)] can be used, for calculations over 7 seasons (see above).

Event Analysis

From the above table, several observations can be made.

The following are for particular events:

Intentional base on balls. These occur most frequently when first base is not occupied; the largest increases when second and third are, with one or two outs.

Sacrifice hit. This event occurs in order to allow a baserunner(s) to advance. They therefore do so with at least one baserunner, and less than two outs. There are only specific situations, however, for when this probability is significant (meaning that the probability factor is not within 0, with consideration of uncertainty); as follows:

  • (12,0)
  • (2,0)
  • (1,0)
  • (1,1)
  • (12,1);

and squeeze plays:

  • (13,0)
  • (13,1)
  • (3,1).

Insignificant states are (2,1) for the former, and (3,0) and states with a baserunner on second base for the latter.

Sacrifice fly. This event occurs when there is a baserunner on third and/or second base, with less than two outs. They only occur significantly (defined above) though when a baserunner is on third. While it does occur with a baserunner on second and not third base, the probability for this is insignificant.

Hit by pitch. This event seems to be unaffected by baseball state.

Other events will be considered below.

Probability Adjustments

Changes in event probabilities (for a particular state) are must be offset by opposite changes in others (for that state).

These generally occur for strikeouts, base on balls, and outs.

The probabilities of hits are relatively unaffected by baseball state;only for state (23,2) are the probabilities of all hits noticeably decreases, and for state (23,1) is the probability of a home run significantly decreased.

(Verification) Assumption of Uniformity

Calculations of probability factors over recent seasons can be used to verify the assumption of uniformity (see above).

Example: Probability Factors (2007 and 2000)

The most recent seasons for which for centered data (and over 7 seasons) can be calculated with no overlap (with 2014, or each other) are 2007 and 2000.

Probability factors for these seasons are reported in the following tables:

2007:

State No.: (B,o)KBBIBBHBPout1B2B3BHRSHSF
0: (∅,0)0.976(3)0.907(6)0.0(1)0.90(5)1.0521(8)1.024(3)1.063(8)1.12(8)1.10(1)0(0)0(0)
1: (1,0)0.816(8)0.77(2)0(0)1.0(1)0.982(3)1.096(8)1.04(3)0.8(3)1.02(5)7.6(2)0(0)
2: (2,0)0.81(2)1.03(3)0.6(4)1.3(3)0.957(5)0.98(2)0.90(6)1.0(5)0.8(1)9.4(1)0(0)
3: (3,0)0.91(4)1.08(9)1(1)1.6(8)0.75(2)1.11(5)1.0(2)1(1)0.8(3)0.1(9)17.3(1)
4: (12,0)0.87(2)0.80(4)0.0(5)0.9(3)0.947(6)0.99(2)0.87(6)0.7(6)1.0(1)11.2(1)0.0(5)
5: (13,0)0.78(3)0.89(7)0.7(8)1.4(6)0.75(1)1.16(3)1.1(1)1(1)1.0(2)1.9(6)19.1(3)
6: (23,0)0.87(4)0.97(8)6.3(6)1.4(7)0.74(1)1.07(4)1.0(1)1(1)0.8(2)0.0(8)16.28(2)
7: (123,0)0.91(4)0.69(8)0(0)1.3(6)0.78(1)1.13(4)1.2(1)1(1)1.1(2)0(0)18.6(2)
8: (∅,1)1.053(3)0.965(8)0.0(1)0.90(7)1.039(1)0.990(4)1.00(1)1.0(1)1.00(2)0(0)0(0)
9: (1,1)0.914(7)0.88(2)0.0(2)1.1(1)1.012(3)1.110(8)1.04(3)0.8(3)1.10(4)2.36(8)0(0)
10: (2,1)0.98(1)1.32(2)4.54(7)1.1(2)0.971(4)0.93(1)0.93(4)1.0(4)0.88(7)0.1(3)0.0(3)
11: (3,1)0.88(2)1.28(4)5.1(3)1.7(4)0.675(8)1.08(2)0.98(8)1.3(7)0.8(1)0.9(4)16.9(4)
12: (12,1)0.99(1)0.95(3)0(0)1.2(2)1.026(4)1.00(1)1.00(5)0.8(5)1.01(8)1.7(2)0.0(4)
13: (13,1)0.84(2)0.85(5)1.1(5)1.2(4)0.742(7)1.12(2)1.03(7)0.8(7)0.9(1)2.1(4)19.3(6)
14: (23,1)0.84(2)1.07(5)21.7(7)1.4(4)0.617(9)0.90(3)0.84(8)1.1(8)0.6(2)0.2(5)14.8(2)
15: (123,1)0.96(2)0.79(5)0(0)1.4(4)0.776(8)1.06(2)1.06(8)0.9(8)0.9(1)0.1(5)19.1(5)
16: (∅,2)1.111(4)1.077(9)0.0(2)0.90(8)1.011(1)0.969(5)0.97(2)0.9(2)0.99(3)0(0)0(0)
17: (1,2)1.027(7)0.98(2)0.0(2)0.9(1)1.032(3)1.026(8)1.00(3)0.9(3)1.05(5)0(0)0(0)
18: (2,2)1.05(1)1.44(2)8.1(2)1.1(2)0.910(4)0.85(1)0.87(4)1.1(3)0.83(6)0(0)0(0)
19: (3,2)1.05(2)1.53(3)5.9(1)1.2(3)0.936(6)0.83(2)0.86(6)0.9(6)0.8(1)0(0)0(0)
20: (12,2)1.12(1)1.09(2)0.1(3)1.2(2)1.018(4)0.92(1)0.95(4)1.3(4)0.93(7)0(0)0(0)
21: (13,2)1.02(2)1.16(4)0.6(5)1.4(3)1.014(6)0.98(2)0.99(6)1.1(6)0.8(1)0(0)0(0)
22: (23,2)1.08(2)1.44(4)13.3(2)1.2(4)0.857(8)0.76(2)0.81(8)1.1(7)0.8(1)0(0)0(0)
23: (123,2)1.12(2)0.99(4)0.0(5)1.1(4)1.023(7)0.93(2)1.00(7)1.3(7)1.0(1)0(0)0(0)
24: (*,3;0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
25: (*,3;1)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
26: (*,3;2)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
27: (*,3;3)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)

2000:

State No.: (B,o)KBBIBBHBPout1B2B3BHRSHSF
0: (∅,0)0.969(3)0.912(6)0(0)0.91(5)1.0534(8)1.027(3)1.061(9)1.09(8)1.10(1)0(0)0(0)
1: (1,0)0.806(8)0.76(2)0.0(3)1.0(2)0.984(3)1.095(8)0.98(3)0.8(3)0.99(5)8.0(2)0(0)
2: (2,0)0.84(2)1.06(3)0.7(4)1.2(3)0.956(5)0.98(2)0.86(6)0.9(5)0.7(1)8.84(7)0.0(4)
3: (3,0)0.93(4)1.15(9)1(1)1.3(8)0.73(2)1.11(5)1.0(2)1(1)0.7(3)0.1(9)17.64(6)
4: (12,0)0.82(2)0.78(4)0(0)1.2(3)0.942(6)1.01(2)0.92(7)0.9(6)1.0(1)11.4(1)0.0(4)
5: (13,0)0.79(3)0.75(6)0.9(8)1.3(6)0.74(1)1.22(3)1.1(1)1(1)1.0(2)2.0(5)18.2(3)
6: (23,0)0.95(4)1.02(8)4.9(7)2.0(7)0.71(1)1.04(4)1.0(1)1(1)0.8(2)0.2(8)16.5922(4)
7: (123,0)0.91(4)0.68(8)0(0)1.3(7)0.76(1)1.16(4)1.1(1)1(1)0.9(2)0.0(8)19.5(2)
8: (∅,1)1.066(3)0.972(7)0.0(1)0.91(7)1.035(1)0.991(4)1.00(1)1.0(1)0.99(2)0(0)0(0)
9: (1,1)0.882(8)0.87(2)0.0(2)1.0(1)1.027(3)1.127(7)1.05(3)0.9(3)1.02(4)2.17(9)0(0)
10: (2,1)1.02(1)1.39(2)4.43(8)1.2(2)0.952(4)0.91(1)0.93(4)1.0(4)0.83(7)0.1(3)0.0(3)
11: (3,1)0.91(2)1.25(4)5.0(3)1.6(4)0.656(8)1.07(2)0.96(8)1.2(7)0.8(1)0.9(4)17.3(4)
12: (12,1)0.97(1)0.92(3)0.0(4)1.1(2)1.033(4)1.02(1)1.03(5)1.0(4)0.99(8)1.6(2)0.0(3)
13: (13,1)0.82(2)0.84(4)0.9(5)1.3(4)0.744(7)1.14(2)1.07(7)1.3(7)1.0(1)1.5(4)18.8(5)
14: (23,1)0.86(2)0.96(5)22.0(7)1.5(4)0.631(8)0.85(3)0.91(9)1.0(8)0.6(1)0.3(5)14.6(2)
15: (123,1)0.94(2)0.74(5)0(0)1.4(4)0.772(8)1.06(2)1.10(8)1.3(7)1.1(1)0.2(5)18.6(5)
16: (∅,2)1.112(4)1.092(8)0.0(2)0.90(8)1.008(1)0.962(5)0.97(2)0.9(2)1.04(2)0(0)0(0)
17: (1,2)1.010(7)0.97(2)0.0(2)0.9(1)1.037(3)1.029(8)0.99(3)1.0(3)1.08(4)0(0)0(0)
18: (2,2)1.05(1)1.45(2)8.1(2)1.1(2)0.916(4)0.84(1)0.89(4)1.0(3)0.77(6)0(0)0(0)
19: (3,2)1.08(2)1.50(3)5.8(1)1.3(3)0.929(6)0.83(2)0.87(6)1.0(6)0.8(1)0(0)0(0)
20: (12,2)1.12(1)1.08(2)0.1(3)1.2(2)1.029(4)0.90(1)0.94(4)1.1(4)0.90(7)0(0)0(0)
21: (13,2)1.03(2)1.09(4)0.5(5)1.3(3)1.012(6)0.99(2)1.03(6)1.0(6)1.0(1)0(0)0(0)
22: (23,2)1.10(2)1.39(4)12.9(2)1.2(4)0.875(8)0.76(2)0.83(8)0.8(8)0.7(1)0(0)0(0)
23: (123,2)1.18(2)0.92(4)0.0(6)1.2(4)1.024(7)0.89(2)1.03(7)1.6(6)1.0(1)0(0)0(0)
24: (*,3;0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
25: (*,3;1)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
26: (*,3;2)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)
27: (*,3;3)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)1(0)

Comparing all tables shows that things appear to be both qualitatively and quantitatively (see the following check) similar.

An interesting state to quantitatively consider is (23,1). The probabilities of both intentional base on balls and sacrifice fly are significantly increased; the probability of an out is (most) significantly decreases; that of hit by pitch is not necessarily within 1, considering uncertainty; it is a state for which home runs is significantly affected (see the above discussion); and deviations for other events occur.

Probability factors are repeated (from the above tables) for this state, for convenient comparison, in the following table:

SeasonKBBIBBHBPout1B2B3BHRSHSF
20140.85(2)1.11(6)20.6(7)1.6(4)0.635(9)0.97(3)0.90(9)1.2(9)0.6(2)0.4(6)15.9(3)
20070.84(2)1.07(5)21.7(7)1.4(4)0.617(9)0.90(3)0.84(8)1.1(8)0.6(2)0.2(5)14.8(2)
20000.86(2)0.96(5)22.0(7)1.5(4)0.631(8)0.85(3)0.91(9)1.0(8)0.6(1)0.3(5)14.6(2)

The probability factors are indeed both qualitatively and quantitatively (very) similar; in most cases, within uncertainties for all three seasons. The only exceptions are \text{BB}, \text{1B}, and \text{SF}, all increasing in more recent seasons. Even so (when when considering uncertainties), the change is slow (overlapping or nearly so uncertainties, between adjacent seasons).

Considering these results (including the exceptions), justifies the above assumption.

Discussion

A method to accurately determine state-based event probabilities and their uncertainties was developed.

This was based on consideration instead of probability factors. Whereas probabilities change over time, these do so only slowly [the assumption of uniformity (see above), which was verified (see above)]. This subtle change in consideration resolves the data amount/change dilemma.

A statistical analysis of the probability-factor matrix was performed.

Consideration was made to the important balance between accuracy and (possible) slow changes in probability factors. A criterion was developed to optimally achieve this.

In order to calculate the matrix elements according to this criterion, data over 7 seasons was needed.

While (some of) the results are qualitatively expected, quantitative probability factors can be determined by this method. This is important for determining precisely how the probabilities are affected in any baseball state; and these can be significant.

Conclusions

A method to accurately determine state-based event probabilities was developed.

Following a statistical analysis, the amount of data used ensures that the probability that the event probabilities for all baseball states is within 2% is > 95%; on average, > 99%.

State-based event probabilities were determined, and discussed for 2014.

These results provide quantitative information.

They also provide fundamental insight into the mechanics of baseball.

They will find several applications; for example, in the event-based framework for the Markov chain model of baseball. Such applications will be discussed in a future article(s).

References

[1] S. K. Thompson, “Sample Size for Estimating Multinomial Proportions,” The American Statistician 41, 42–26 (1987)

[2] “Strategies for Variance Estimation,” 37. Accessed: 2018-07-21

Share.

About Author

statshacker

statshacker is an Assistant Professor of Physics and Astronomy at a well-known state university. His research interests involve the development and application of concepts and techniques from the emerging field of data science to study large data sets. Outside of academic research, he is particularly interested in such data sets that arise in sports and finance. Contact: statshacker@statshacker.com

Leave A Reply