Prediction of Strikeout Rate (K%) by Machine Learning


Batter–pitcher matchups are a (if not the) significant component of the game of baseball. Understanding these provides important insight into the game. This finds several applications.

In this work, the underlying probability distribution describing batter–pitcher matchups are determined (numerically, and in principle) exactly. This is done using machine learning.

As an application, strikeout rate (K%) is considered.

The results show a steady and significant decrease in both predictive accuracy and discrimination, season-over-season.

They are also used to assess the analytical log5 method. Within uncertainties, this is found to be equally as accurate.

Considering individual seasons in more detail though shows that the probability distribution described by the log5 formula is not the same.

There appears to be a subtle bias in the log5 results.

These results are significant for baseball research, providing extremely high-quality numerical results.

To cite this Article:

statshacker, “Prediction of Strikeout Rate (K%) by Machine Learning,” statshacker [] Accessed: YYYY-MM-DD


The game of baseball has a discrete, well-defined, and relatively “clean” structure.

A significant portion of this are the events that result from specific batter–pitcher matchups. Because of the stochastic nature of baseball, it is more specifically the probabilities of these that are of theoretical concern.

Accurate determination of these probabilities provides significant insight into the game of baseball. This, in turn, finds several applications. These include, for example, simulated games and evaluating players; and associated applications, such as strategy decisions and lineup determination, respectively.

The data-heavy nature of baseball makes it particularly well suited to machine-learning methods. These, in principle, are capable of learning the exact underlying probability distribution describing these matchups.

In this Article, strikeout rate (K%) (probability) in batter–pitcher matchups is determined (numerically) exactly. This is done through using machine learning.

It is important to realize the purpose of this Article. It is a proof of concept. It is not to determine how accurate (overall) this determination can be made (indeed, assumptions are later discussed). Rather, it is to provide the answers to several questions, such as the following: Can machine learning be used to predict the underlying probability distribution describing batter–pitcher matchups? Note that this is also an interesting question from the perspective of machine learning (see below). Given only direct data, how accurate of a predication can be made? How accurate are analytical estimations? Etc.

Pairwise Comparisons

A comprehensive consideration of pairwise comparisons in baseball will be discussed in a future article.

The precise details of such methods are not important for the results presented and discussed herein anyway.

Important, however, is specification of a gold standard test, which can be used as a benchmark. Note that this term is therefore used as a definition, as the best performing test available.

log5 Method

For baseball, a gold-standard test is the log5 method.

This method is theoretically equivalent to the Bradley–Terry model of pairwise comparisons [1].

As applied to baseball, the probability that one team beats another was independently derived [2] by Bill James [3].

Later [4], James, in collaboration with Dallas Adams, extended this method to specific pitcher–batter matchups. The important difference is that the league average, in this case, is not necessarily 0.500.

The log5 method may be formulated as follows. For a specific batter–pitcher matchup, the probability P of event E P(E) of a Bernoulli trial is calculated as

(1)   \begin{equation*} P(E) = \frac{\frac{P_\text{B} P_\text{P}}{P_\text{L}}}{\frac{P_\text{B} P_\text{P}}{P_\text{L}} + \frac{(1 - P_\text{B})(1 - P_\text{P})}{(1 - P_\text{L})}} \end{equation*}

where P_\text{B}, P_\text{P}, and P_\text{L} are the probabilities of success for the event for the batter \text{B}, pitcher \text{P}, and league \text{L}, respectively.

As remarked above, the event of interest herein is a strikeout.

An early empirical test [5] of this method as applied to batter–pitcher matchups (specifically, batting average) showed that this formula provides an accurate model. A more recent consideration [6] (specifically, for the probability of a strikeout) also came to this conclusion.

Machine-Learning Methods

Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to “learn” with data, without being explicitly programmed [7].

Machine learning is a broad field, with several applications.

Consider an unknown target function f: \textbf{X} \rightarrow \textbf{Y} which maps input \textbf{X} to output \textbf{Y}. Consider that instead of access to f, one has access to a number N of example mappings (\textbf{x}_1,\textbf{y}_1), \ldots, (\textbf{x}_N,\textbf{y}_N) (generated by f). Machine learning, in this context, gives a computers the ability to “learn” from this data, and select from a hypothesis set (infinite, in this case — see below) a model g \approx f.

Particularly powerful is that certain methods (as employed herein) satisfy the universal approximation theorem [8] — that is, they can (i.e., in principle) approximate any continuous function. Important to note (and as will be discussed below) though is that this theorem does not discuss the algorithmic learnability (i.e., in practice) of them. Note also that the methods consider the problem herein as one of classification, though the underlying mapping of interest [i.e., which Eq. (1) is an approximation to] is continuous.

As applied to baseball research, the use of machine learning is only an emerging field. Indeed, a recent systematic literature review [9] shows only very few (though increasing) applications.

Machine learning has been applied though in tangential contexts to that considered herein. For example, it has been used in academic settings [10]; and more recently [11] to predict the outcome of an at-bat.

Herein, interest is on more fundamental questions (outlined above).

This problem is also an interesting one from the perspective of machine learning. This is because the stochastic nature of the outcome of events leads to significant class label noise. Consider the league strikeout rate (e.g., as calculated below, for 2017, this is approximately 21.8%); and compare this to the extremes — 50% would be completely random, and 0% would be trivial.


This section describes application of the data mining process (detailed here) to this problem.

Data Understanding

Play-by-play data was obtained from Retrosheet.

Data Preparation

Data was processed and stored in a relational database using the relational database management system PostgreSQL.

The following data preparation used DB++ as an interface to PostgreSQL, and bbDBi as an interface to the baseball database.

Batter–pitcher matchup data was obtained for the 1990–2017 seasons. Note that no filtering by league nor game type (regular vs. postseason, or any other filtering) was considered.

This was done by extracting all events from the corresponding table where BAT_EVENT_FL = ‘t’, and where EVENT_CD was one of the following:

  • generic out
  • strikeout
  • walk
  • hit by pitch
  • error
  • fielder’s choice
  • single
  • double
  • triple
  • home run;

not included were the following events:

  • intentional walk
  • (catcher’s) interference;

and the following event was not found:

  • foul error.

Note that events excluded were those based not solely upon the ability of the batter or pitcher.

From this information, a dataset of IDs was created as

    \begin{equation*} \text{[batter ID]} \text{[pitcher ID]} \text{[game ID]} ~~~ ; \end{equation*}

and since only strikeout information is of interest, the above events were condensed into a binary value \lbrace 0,1 \rbrace,

(2)   \begin{equation*} \text{[K]} ~~~ , \end{equation*}

(i.e., whether or not a strikeout occurred).

Input needed for Eq. (1) or machine learning are not IDs, but the following information:

  • \text{[batter K\%]}
  • \text{[pitcher K\%]}

and in some cases [e.g., Eq. (1)]

  • \text{[league K\%]}

where \text{[batter K\%]}, \text{[pitcher K\%]}, and \text{[league K\%]} are the strikeout rates for each batter, pitcher, and the league, respectively.

These statistics were calculated directly from reported statistics, under the assumption of uniformity.

Seasonal (yearly) data was used for calculating strikeout rates.

Note that this choice is based on several considerations. One is that, for the application considered herein, statistics for the batters, pitchers, and league must be calculated over the same time frame. (Interest is in having the computer “learn” directly from the data — i.e., information about the batters, pitchers, and league from specific batter–pitcher matchups and their outcomes; all considered over the same time frame, without additional adjustments.) Another is that sufficient time is needed to “reliably” estimate these statistics (see below). On the other hand, short enough time is needed so that the statistics do not change significantly (e.g., strikeout rates have been increasing steadily in recent years).

Note that because interest is not on forward prediction, but rather on how the underlying statistics are related to their outcome, using “present” data is correct.

Statistics were calculated as follows:

(3)   \begin{eqnarray*} \text{batter K\%} =& \frac{\text{K}}{\text{PA}'} \\ \text{pitcher K\%} =& \frac{\text{K}}{\text{BF}'} \end{eqnarray*}

where \text{PA}' and \text{BF}' are a modified plate appearance (\text{PA}) and batters faced (\text{BF}),

    \begin{eqnarray*} \text{PA}' =& \text{PA} - \text{IBB} - \text{CINT} \\ \text{BF}' =& \text{BF} - \text{IBB} - \text{CINT} \end{eqnarray*}

where \text{IBB} and \text{CINT} are intentional walks and catcher’s interference, respectively. These modifications are necessary for consistency with the extraction of batter–pitcher matchup data (discussed above).

The league calculation was made analogous to Eq. (3) (either equation).

In order to “reliably” calculate the quantities in Eq. (3), one must consider sample size; in particular: how many \text{PA} (or \text{BFP}) are necessary to reliably estimate K%. Defining “reliable” to be the point at which the signal-to-noise crosses the halfway point, it has been shown [12] that 60 \text{PA} are needed (for this statistic).

Events involving batters or pitchers with less than 60 \text{PA} or \text{BFP} (unprimed), respectively, were therefore removed from the dataset (of IDs).

Note that these events were not removed from the calculation of statistics for other batters or pitchers though (e.g., a batter with 502 \text{PA} may include data for when facing a pitcher with only 59 \text{BFP}) or the league. This is correct, as the concern is not with the events themselves, but rather only the sample size necessary to evaluate Eq. (3).

No other filtering (removal of pitchers from batting, etc.) was performed.

Consider this in the context of the data. Batter and pitcher K% are calculated relative to the league average. The outcomes of all batter–pitcher matchups remaining in the data have a slightly different K% (because of filtering) (e.g., for 2017, this is 0.2176 compared to 0.2135, respectively). Note that contextual adjustments to the log5 method are not needed, in this case. The outcomes are not affected by filtering, and there is thus no additional context. Why this also does not affect the machine-learning results is discussed in the context of data organization for individual seasons.

Even after filtering, there remains a massive amount of data (4,712,853 events).

For calculations considering all 28 seasons, a subset of 28,000 points were randomly selected from the total dataset (i.e., approximately 10,000 per season).

For calculations considering a single season, all data points (for that season) were used.

Data for machine learning was pre-processed by standardizing the input data. Normalizing the inputs in this way usually leads to faster convergence [13].

Output data was not (needed to be) normalized, since it is already stored in a format for classification (see above).


Machine learning is used herein to model the strikeout rate.

Conventional machine learning (as opposed to deep learning [14]) is expected to work well for this problem. This is because the input data is relatively simple. The raw input data (events) is already preprocessed using good feature extractors (batting and pitching statistics), with the underlying explanatory factors features separated. There is therefore not an invariance problem — i.e., irrelevant variations in the input data. And while there is a large data set, with this preprocessing and low dimensionality, it is unlikely that multiple levels of abstraction are needed to describe the unknown target function (see above).

The underlying model is therefore based on a standard multilayer feedforward neural network [15]. This method is well suited for the large data set (as opposed, for example, to an algorithm that considers similarity between examples expressed by a kernel).

Parameters of this model must be “calibrated” to optimal values. This is done by cross validation. In this section, only these values are reported. Results for these optimal values are reported below.

Testing one and two hidden-layer architectures suggests that the results are relatively insensitive to this choice (probably due to the simplicity of the problem, and the amount of data). Taking the results literally (i.e., not considering random fluctuations) suggests that the optimal architecture is

    \begin{equation*} n_x{\textendash}5{\textendash}5{\textendash}1 ~~~ , \end{equation*}

n_x input units (\text{[batter K\%]}, \text{[pitcher K\%]}, and possibly \text{[league K\%]}), two hidden layers each with 5 units, and 1 output unit (\text{[K]}); as shown in the following figure:

The hidden units each use a softplus activation function [16]. These activation functions were compared to the standard logistic function and modified tanh [13], and found to give the best results.

The output unit was the standard logistic function.

Training of the network used the QRprop algorithm [17] (with standard settings [17]). This was found to provide better results than standard backpropagation (even with standard “tricks” [13] — though, such have to be considered carefully — see the following notes, especially in the context of ensembles).

The bias–variance tradeoff [18] was carefully considered (see again below the discussion about ensembles, and model averaging). Early stopping [19] was used to prevent overfitting [18].

Training data was subdivided, using the following partitioning scheme: 80% training and 20% testing (randomly selected); note that the former was further partitioned into 64% (actual) training (weight adjustment) and 16% for validation (early stopping — see above) (both relative to the total amount).

The loss function was defined in terms of cross entropy (discussed below).

Note that the output logistic function and cross entropy error function are theoretically justified for classification problems. In particular, the logistic function is the cumulative distribution function of the logistic distribution; and cross entropy is the negative log likelihood of the Bernoulli distribution (i.e., related to the probability of the dataset, using the estimated parameters).

In further consideration of the bias–variance tradeoff, neural networks were combined in a weighted ensemble [20] (see below); consisting of 100 individual networks.

(Total) training data for each network was selected by bootstrapping [21]. In this way, each network is trained on a different sample that is inferred to be drawn from the population (rather than only the sample).

Note that it has been found [22] that ensemble-averaging results are improved by overtraining individual networks. (For the bias–variance tradeoff, this further reduces the bias, at the expense of higher variance; the latter is then reduced by the ensemble average.) For the averaging used herein (see below), however, early stopping (see above) gives better results. This may be because the individual networks are considered to form a hypothesis space, from which the most likely one is selected.

Following training, the weights of the ensemble members were adjusted using Bayesian model averaging (BMA) [23]. By this, the ensemble approximates the Bayes Optimal Classifier (an ensemble of all the hypothesis in the hypothesis space; on average, no other classification method using the same hypothesis space and same prior knowledge can outperform this method [24]). Consistent with that this approach weights the individual networks by their likelihood given the data, the total dataset used for training was used.

Note that model averaging was found to give better results than combination [25]. This is expected [23], as model averaging accounts for model uncertainty; which is high in this case of significant class label noise (see above).


The outcome of all batter–pitcher matchups were considered as probability distributions.

Descriptive statistics (the first four central moments) were used to describe them.

Bootstrapping was used to estimate the uncertainties in these statistics, using 10,000 resamples.

The bias in each statistic (bootstrapped) was also estimated. These values were small; in fact, within the error bars of each quantity. They are therefore not reported (nor are bias-corrected statistics, since these are one further abstraction from the population parameters).

Evaluation was carried out using the following three measures, each which provide different insight.

First: Foremost is the cross entropy,

(4)   \begin{equation*} C = {-\frac{1}{N} \sum_\textbf{x} y \log\hat{y} + (1 - y) \log(1 - \hat{y})} \end{equation*}

where the sum runs over the N inputs \textbf{x}, y is the one-hot representation of the label (\text{[K]}), and \hat{y} is the (probability) output (the latter quantities for a single input).

Cross entropy is a proper scoring rule (strictly proper) that measures dissimilarity between two probability distributions over the same underlying set of events; in this case, \lbrace y, 1 - y \rbrace and \lbrace \hat{y}, 1 - \hat{y} \rbrace.

Note that cross entropy is a rescaling of the gold standard optimization criterion (the log likelihood — see above); in a sense, it is the therefore best accuracy score to use.

Second: Another proper scoring rule (strictly proper) is the Brier score,

(5)   \begin{equation*} BS = \frac{1}{N} \sum_\textbf{x} (\hat{y} - y)^2 \end{equation*}

(for binary events).

This quantity corresponds to the square of the L^2 distance between the predicted and true label distributions.

Bootstrapping was also used to estimate the uncertainties in these calculations.

Third: A final measure considered is the area under (AU) the receiver operating characteristic (ROC) curve (AUROC).

The AUROC was calculated by first threshold averaging [26] 100,000 (empirical) ROC curves, generated by bootstrapping, and using 10 thresholds. A “proper” binormal ROC curve [27] was then fit to the averaged one (and considering the error bars from averaging). The area under this curve was calculated by summing an analytical part [in terms of the cumulative distribution function of the normal distribution]and a numerical one [in terms of that of the standardized bivariate normal distribution].

This measure is not a proper scoring rule; but it does provide additional insight.

There are several equivalent interpretations of this measure. A common one, for example, is the expectation that a uniformly drawn random positive example is ranked before a negative one. This implies a measure of predictive discrimination.


Cross Validation

Optimization of parameters for the machine-learning model was carried out by k-fold cross validation, with k = 28; each subsample consists of a single season.  Repeating k times, each subsample is used exactly once as the validation data.

In this case, data was organized in the following format:

    \begin{eqnarray*} & \text{[batter K\_1\%]} \text{[pitcher K\_1\%]} \text{[league K\_1\%]} \text{[K\_1]} \\ & \text{[batter K\_2\%]} \text{[pitcher K\_2\%]} \text{[league K\_2\%]} \text{[K\_2]} \\ & \vdots \\ & \text{[batter K\_N\%]} \text{[pitcher K\_N\%]} \text{[league K\_N\%]} \text{[K\_N]} ~~~ . \end{eqnarray*}

Cross entropy [Eq. (4)] was used as the single error measure to optimize.

The total error was then calculated as the average,

(6)   \begin{equation*} C = \frac{1}{k} \sum_{i = 1}^{k} C_i ~~~ , \end{equation*}

using the (standard) propagation of uncertainty.

Errors for each of the subsamples (by year) are shown in the following table:

Yearlog5Machine Learning (if different)

Note that machine learning results are reported only different than the log5 ones.

It can be seen that there is no discernible (i.e., to within uncertainties) difference between the results; in fact, most seasons are precisely the same, including uncertainties.

A trend is seen in these results; the predictive accuracy has decreased steadily and significantly season-over-season.

The total error [Eq. (6)] is shown in the following table:

log5Machine Learning

As expected (based on the results above), there is no discernible difference, even with the much lower uncertainty.

Nonetheless, the accurcay remains higher than naive predictions based only on batter, pitcher, or league averages; as shown in the following table:


Note that the former quantity gives the lowest error out of the three. This is consistent with a recent study [6] that found that batters control the majority of the variance in predicted strikeout rate.

Individual Seasons

Data organized in the above format, however, obscures the information contained in it. Consider two different batter–pitcher combinations from different seaspns with the same league K%. The above organization cannot resolve this; and hence, information about the batters and pitchers (over which K% was calculated for their statistics) and their outcome is obscured.

Consider instead the following format:

    \begin{eqnarray*} & \text{[batter K\_1\%]} \text{[pitcher K\_1\%]} \text{[K\_1]} \\ & \text{[batter K\_2\%]} \text{[pitcher K\_2\%]} \text{[K\_2]} \\ & \vdots \\ & \text{[batter K\_N\%]} \text{[pitcher K\_N\%]} \text{[K\_N]} \end{eqnarray*}

where the N data points run over all batter–pitcher combinations for which the K% statistics were calculated (i.e., a single season). This organization contains in fact more (at least, resolved) information: Distribution of K% information (for batters and pitchers) are now described; while league K% (total) information is no longer described, it is constant, and does not affect the results; and average K% is now described (by the output); etc.


As mentioned above, this problem can be considered one in which there is considerable class label noise.

Machine learning is therefore first verified that it can be practically applied to this problem.

In order to verify several of the above results, the following two points were considered:

First tested is its ability to learn the underlying probability distribution that generates the class labels ([\text{K}]).

This is done by using the log5 method as a precisely known probabilities. This distribution can be used to generate (an infinite amount of) training data, as follows. Probabilities p are first specified as the target output,

    \begin{equation*} [p_\text{log5}] ~~~ , \end{equation*}

instead of Eq. (2). Rather than train directly on these probabilities, class labels [Eq. (2)] are randomly generated according to them at each epoch during training.

Note that training this way mimics that for the actual data, keeping everything consistent (in practice) (types of data encountered, loss function, etc.).

For BMA, Brier scores (technically, in this case, mean squared errors) can use probabilities directly for hypothesis selection. Note that it is easy to show that this is correct for an infinite set of data. While

    \begin{equation*} \left( \hat{y} - p \right)^2 \neq \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{i = 1}^N \left( \hat{y} - \lbrace 0,1 \rbrace_i \right)^2 = p \left( \hat{y} - 1 \right)^2 + \left( 1 - p \right) \left( \hat{y} - 0 \right)^2 \end{equation*}

where \lbrace 0,1 \rbrace is the one-hot representation generated by p, the minima of the left- and right-hand sides occur for the same values of p and \hat{y}; moreover, the shapes of the functions for p \neq \hat{y} are the same. Since log likelihoods are defined relative to a maximum value, the two results are identical.

Consider, for example, results for 2017.

Empirical probability density functions (PDFs) (plotted as histograms) and cumulative distribution functions (CDFs) are shown in the following figure, for both the log5 and machine-learning (denoted in all following figures as ML) approaches:







Note that the \text{P(K)} axes have been truncated at 0.6 for clarity, as less than 0.06% of the data (log5 estimate) occurs above this.

The PDFs show qualitatively (visually) that the distributions are nearly identical.

Quantitatively, corresponding to the PDFs are descriptive statistics shown in the following table:

MethodMeanStandard DeviationSkewnessKurtosis
Machine Learning0.2136(5)0.0854(4)0.85(2)0.88(7)

There is no quantitative difference that can be resolved between the distributions.

Not considering the uncertainties, for the sake of discussion, the quantitative differences can be understood. For example, the average strikeout rate for the test data is 0.2133 (which the log5 method correctly predicts); while that for the training data for machine leaning is 0.2136 (which it therefore understandably assumes for the test data).

Other minor qualitative differences can therefore be attributed to, and illustrate the levels of uncertainties.

In order to (statistically) determine whether the two distributions are different, the two-sample Kolmogorov–Smirnov (K–S) test [28,29] was performed. This is used to test whether the underlying probability distributions (of the two samples) differ.

For the data above, the K–S statistic D and p-value are shown in the following table:


The null hypothesis (i.e., that the two distributions are different) can certainly be rejected. The samples are therefore consistent (very much so, given the p-value) with coming from the same underlying probability distribution.

Shown also for reference below is the difference between CDFs.

While there is no difference near the average \text{P(K)} [at 0.2135(4) — average from the above table], away from this there are noticeable ones. Above \text{P(K)} \approx 0.15, there is an antisymmetry about \overline{\text{P(K)}}, in that the log5 results underestimate below it, and overestimate above. It is not precisely antisymmetric though, in that the results above are skewed towards higher results.

Important, however, is the peak below \text{P(K)} \approx 0.15. This does not follow the trend. This makes interpretation of these results difficult. The high p-value of the K–S test suggests that this may be mostly “noise”. In this context (and below), this is meant to imply any numerical bias, uncertainty, or related effect(s). In this way, this peak sets the “noise” scale.

It can therefore be concluded that machine learning is capable of determining a known underlying probability distribution.

The machine-learning and log5 approaches were also compared more directly by looking at the difference between predictions.

These results are shown in the following table:

MethodMeanStandard DeviationSkewnessKurtosis
(log5 - ML)-0.00028(1)0.00201(4)-2.6(4)43(5)
(log5 - ML)^20.0000041(1)0.000027(2)28(2)1000(100)

The average difference between the predictions are remarkably small; the root mean square is only 0.0020(4), compared, for example, to the uncertainty between the means, 0.0007. Such agreement is not unexpected, given the similarity between the distributions (discussed above).

Notably different, however, is the skewness and kurtosis of the results. Relative to each other, the machine-learning results are skewed towards higher values; and, the differences form a curve that is leptokurtic (slender near the mean, with fat tails). A plausible explanation for this is the behavior of the loss function used for machine learning [Eq. (4)]. This harshly penalized predictions in the wrong direction away from 0.5. This could lead to both a skew towards higher values (as seen in the skewness) and a flattening out (as seen in the kurtosis).

2015, 2016, and 2017 Seasons

Machine learning was applied to the three most recent (complete) seasons (2015, 2016, and 2017).

The following table shows the three evaluation metrics for the above data:


By all measures, this and the log5 approaches provide equally accurate (and uncertain) and discriminating models for the data; therefore, only a single table is reported.

The decrease in predictive accuracy (both measures, in this case) and discrimination for more recent seasons is again apparent.

Close inspection of the results, however, reveals some subtle differences.

PDFs and CDFs are shown in the following figures:














These figures show qualitatively that the distributions are similar; though, with differences compared to training against the log5 distribution (see above).

Consider first the descriptive statistics for the PDFs; shown in the following tables:


MethodMeanStandard DeviationSkewnessKurtosis
Machine Learning0.2138(5)0.0868(4)0.84(2)0.85(7)


MethodMeanStandard DeviationSkewnessKurtosis
Machine Learning0.2094(4)0.0826(4)0.79(2)0.78(8)


MethodMeanStandard DeviationSkewnessKurtosis
Machine Learning0.1988(4)0.0816(4)0.81(2)1.02(8)

The results quantitatively suggest that the distributions are similar. There are some very slight differences though, in these cases, outside of the uncertainties.

With only three seasons worth of data, it is difficult to reasonably determine trend(s) in this data; though, in each case, the machine-learning standard deviation is higher, the skewness remains within error bars, and the kurtosis is lower.

The two distributions were compared, using the K–S test. The results are shown in the following table:


The results are relatively inconclusive. While the null hypothesis can be rejected for 2017, it cannot be done so with as much confidence for the other two seasons. Even for 2017, the data, while consistent with the two methods coming from the same underlying probability distribution, there is much less of such than when trained against the log5 probability distribution (see above).

Note also that while there appears to be a trend in these results (increasing in recent years), this cannot be conclusively stated; further testing would be needed.

Indeed, a close examination of the PDFs and CDFs suggests some subtle differences.

Shown in the following figure is the difference between the log5 and machine-learning results:

The qualitative trends are similar to when trained on the log5 data (above).

Quantitatively, however, the differences are much more significant.

Consider first that the similar peak that occurs at low \text{P(K)}. Given that this has the same magnitude as above and that its appearance is hard to even qualitatively justify supports its use as setting a “noise” scale,

With this scale set, the differences about \overline{\text{P(K)}} can be interpreted as significant. Indeed, comparing the corresponding D and p-values to before shows a significant difference.

Therefore, the qualitative trends may simply be coincidental.

It is important though to check further whether any additional “noise” (see above) may be introduced into the calculations when going from known (log5) to real data.

This is done by comparing the two approaches by looking at differences between predictions.

These are shown in the following table:

MethodMeanStandard DeviationSkewnessKurtosis
(log5 - ML)-0.00041(2)0.00337(4)-2.7(2)21(4)
(log5 - ML)^20.0000115(3)0.000056(6)30(4)1400(400)

These results are similar to when trained on the log5 probabilities (see above).

Not unexpectedly, however, the differences are higher.

The average difference between the machine-learned and log5 predictions though remains remarkably small; the root mean square is only 0.0034(5) (compared to the above). The skewness and kurtosis results [still for (\text{log5} - \text{ML})^2]are within error bars of the previous results.

These results in total suggest that no additional “noise” is introduced with real data, and that the above results are significant.


Assumption of Uniformity

Before discussing these results, it is important to consider the tacit assumption of uniformity.

In this context, this is taken to mean that the outcome of a batter–pitcher matchup depends only on their respective statistics, calculated directly and without adjustment [Eq. (3)].

This therefore makes the simplifying assumptions for the following several quantities:

  • handedness
  • park effects
  • seasonal (yearly) calculation of data
  • league separation

(in approximate order of importance).

Consider these results in the context of strikeout rate. Handedness of batters and (and relative to) pitchers plays a role. There are also some park effects. K% also varies from season to season (increasing over the last several seasons). The National League, for example, has a higher K% than the American League (even with pitchers discounted).

The purpose of this work, however, was not to provide or discuss a comprehensive, fully adjusted and optimal prediction of this outcome. Rather, to answer the questions posed in the Introduction.

These assumptions therefore are valid, for the intended purpose.

It is important though to consider the impact of these assumptions on the results presented.

It seems reasonable to (further) assume that the effect of these assumptions can be considered as noise introduced into the data. Note that “noise” in this context is in the data, unlike that discussed above (link). There is no a priori reason though to suspect that this would affect the results other than simply increase the lower bound on classification error.


Machine learning was used to determine strikeout rate.

Such an approach, in principle, can determine this relationship (numerically) exactly. This was demonstrated to be (essentially) achievable, by training against the (known) log5 probability distribution. This revealed subtle bias and uncertainties expected to be observed in practice.

Results over the last 28 seasons reveal a surprising trend; the predictive accuracy has steadily and significantly decreases season-over-season. The number of seasons considered suggests that this trend is significant. This may also be the case for predictive discrimination; but more seasons would need to be considered, to state with significance. This subject will be considered in a future article.

A comparison against the log5 method shows that it is equally as accurate, to within uncertainties. This was the case for results over the last 28 seasons, and a more detailed comparison (additional measures) over the last three.

A close examination of the underlying probability distributions reveals additional information.

The results exhibit a noticeable skew towards higher probabilities. With respect to this, there does not appear to be a bias in the log5 results (i.e., the skewness is significant), as recently reported [30].

The probability distributions between the machine-learning and log5 results are, however, different.

One plausible explanation could be that the additional “noise” in the data causes the machine-learning results to be conservative (i.e., predict results closer to the league average, on average).

However, an empirical analysis of batter–pitcher matchup data (to be presented in a future article) shows that this bias is real.

This in fact can be seen, but went apparently unnoticed, in an earlier study [5]. This bias has also been noted in an application [31] of the original log5 formula to team wins (see above). Note that bias issues in team wins though can often be rationalized in several ways (for example, teams don’t play themselves). With the significantly higher number of batter–pitcher matchups, not all of the same biases exist.

This difference is slight, however; and it would therefore have only a minor effect on many calculations.

This difference is important though, as discussed below.


Strikeout rate (K%) for batter–pitcher matchups was determined (numerically) exactly.

This was done using machine learning.

These results also showed a steady and significant decrease in both predictive accuracy and possibly discrimination, season-over-season.

The analytical log5 method was assessed, by comparing to these results.

This method was found to be equally as accurate, within uncertainties.

Considering the three most recent seasons in more detail revealed additional information about the underlying probability distributions.

The log5 method appears to be biased, but not in the way previously reported. It appears in that the method underpredicts about below the league average, and overpredicts above.

This difference is important for a deeper understanding of batter–pitcher matchups.

It will therefore be an important guide to future theoretical wok.

It is also important for providing extremely high-quality numerical results; and so it will be important to the most detailed quantitative calculations (e.g., _simulator, by statshacker).


[1] R. A. Bradley and M. E. Terry, “RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS. I. THE METHOD OF PAIRED COMPARISONS,” Biometrika 39, 324–345 (1952)

[2] B. James, “More Log5 Stuff,” BILL JAMES ONLINE [online](2015)

[3] B. James, “Pythagoras and the Logarithms,” Baseball Abstract, pp. 104–110 (1981)

[4] B. James, “Log5 Method,” The Bill James Baseball Abstract, pp. 12–13 (1983)

[5] D. Levitt, “The Batter/Pitcher Matchup,” By the Numbers 9, 18–20 (1999) PDF [online]

[6] G. Healey, “Modeling the Probability of a Strikeout for a Batter/Pitcher Matchup,” IEEE T. Knowl. Data En. 27, 2415–2423 (2015)

[7] A. Samuel, “Some Studies in Machine Learning Using the Game of Checkers,” IBM Journal of Research and Development 3, 210–229 (1959)

[8] K. Hornik, “Approximation Capabilities of Multilayer Feedforward Networks,” Neural Networks 4, 251–257 (1991)

[9] K. Koseler and M. Stephan, “Machine Leaning Applications in Baseball: A Systematic Literature Review,” Applied Artificial Intelligence 31, 745–763 (2017) PDF

[10] CS229: Machine Learning. Accessed: 2018-06-17

[11] M. A. Alcorn, “{\tt (batter|pitcher)2vec}: Statistic-Free Talent Modeling With Neural Player Embeddings,” MIT Sloan Sports Analytics Conference (2018) PDF

[12] R. A. Carleton, “Baseball Therapy: It’s a Small Sample Size After All,” Baseball Prospectus [online](2012)

[13] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient BackProp,” in Neural Networks: Tricks of the Trade: Second Edition, pp. 9–48 (Springer Berlin Heidelberg, 2012) PDF

[14] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521, 436 (2015)

[15] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks 61, 85–117 (2015)

[16] C. Dugas, Y. Bengio, F. B\'{e}lisle, C. Nadeau, and R. Garcia, “Incorporating Second-Order Functional Knowledge for Better Option Pricing,” Proceedings of the 13th International Conference on Neural Information Processing Systems, 451–457 (2000) PDF

[17] M. Pfister and R. Rojas, “Hybrid Learning Algorithms for Neural Networks — The adaptive Inclusion of Second Order Information,” [technical report](1996) PDF

[18] S. Geman, E. Bienenstock, and R. Doursat, “Neural Networks and the Bias/Variance Dilemma,” Neural Computation 4, 1–58 (1992) PDF

[19] N. Morgan and H. Bourlard, “Generalization and Parameter Estimation in Feedforward Nets: Some Experiments,” Ed. D. S. Touretzky, Advances in Neural Information Processing Systems 2, pp. 630–637 (Morgan-Kaufmann, San Mateo, CA, 1990) PDF

[20] M. P. Perrone and L. N. Cooper, “When Networks Disagree: Ensemble Methods for Hybrid Neural Networks” Ed. R. J. Mammone, Artificial Neural Networks for Speech and Vision, pp. 126–142 (Chapman and Hall, 1993) PDF

[21] B. Efron, “Bootstrap methods: Another look at the jackknife,” The Annals of Statistics 7, 1–26 (1979)

[22] U. Naftaly, N. Intrator, and D. Horn, “Optimal Ensemble Averaging of Neural Networks,” Network: Computation in Neural Systems 8, 283–296 (1997) PDF

[23] J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, “Bayesian Model Averaging: A Tutorial,” Statistical Science 14, 382–417 (1999) PDF

[24] T. M. Mitchell, Machine Learning, p. 175 (McGraw-Hill, Inc., 1997)

[25] K. Monteith, J. L. Carroll, K. Seppi, and T. Martinez, “Turning Bayesian Model Averaging Into Bayesian Model Combination,” The 2011 International Joint Conference on Neural Networks, 2657–2663 (2011) PDF

[26] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters 27, 861–874 (2006)

[27] C. E. Metz and X. Pan, ““Proper” Binormal ROC Curves: Theory and Maximum-Likelihood Estimation,” J. Math. Psych. 43, 1–33 (1999)

[28] A. Kolmogorov, “Sulla determinazione empirica di una legge di distribuzione,” G. Inst. Ital. Attuari. 4, 83–91 (1933)

[29] N. V. Smirnov, “On the estimation of the discrepancy between empirical curves of distributions for two independent samples,” Bulletin mathematique de l’Universite de Moscou 2, 2 (1939)

[30] L. C. Morey and M. A. Cohen, “Bias in the log5 estimation of outcome of batter/pitcher matchups, and an alternative,” J. Sports Analytics 1, 65–76 (2015)

[31] R. Ciccolella, “Log5 — Derivations and Tests,” By the Numbers 14, 5–12 (2004) PDF


About Author

statshacker is an Assistant Professor of Physics and Astronomy at a well-known state university. His research interests involve the development and application of concepts and techniques from the emerging field of data science to study large data sets. Outside of academic research, he is particularly interested in such data sets that arise in sports and finance. Contact:

Leave A Reply