This Communication describes the MLB prediction model **MLBp v0.1**.

**status**: alpha

The data mining process at **statshacker** is described here.

Described below are the relevant steps of this process, as applied to this model. Note that (for obvious reasons) only general, high-level details are described; precise ones are not made public.

## “Business” Understanding

The objective of this model is to provide a (probabilistic) classification of the outcome of an MLB game.

## Data Preparation

The (final) dataset consists of parameters selected from the following categories of statistics:

- Batting
- Pitching
- Fielding
- Team fielding
- League information
- Team performance

Prior to modeling, the data is pre-processed.

## Modeling

This model approximates an optimal classifier, using sophisticated machine-learning and statistical methods.

## Evaluation

Evaluation is performed by testing on data for years 2010–2017.

Training and testing are done by walkforward analysis [] (the appropriate evaluation/optimization method for this problem).

Evaluation metrics are described in the remainder of this section.

### Discriminatory Ability

Representative receiver operating characteristic (ROC) curves for the prior two years (2016 and 2017) are shown in the following figure (left and right, respectively):

Note that (coincidentally) these two plots represent points one full standard deviation from the average (on the low and high sides, respectively) (see below).

### Discriminatory Accuracy

The area under the ROC curve (AUC) for each year tested are reported in the following table:

Year | AUC |
---|---|

2017 | 0.5603 |

2016 | 0.6227 |

2015 | 0.5489 |

2014 | 0.5632 |

2013 | 0.6101 |

2012 | 0.5814 |

2011 | 0.6226 |

2010 | 0.5978 |

The average AUC is 0.59(3).

### Prediction Accuracy

Brier scores are reported in in the following table:

Year | Brier Score |
---|---|

2017 | 0.2469 |

2016 | 0.2400 |

2015 | 0.2499 |

2014 | 0.2467 |

2013 | 0.2410 |

2012 | 0.2443 |

2011 | 0.2430 |

2010 | 0.2427 |

The average Brier score is 0.244(3).

## References

[] R. Pardo. Design, Testing, and Optimization of Trading Systems (John Wiley & Sons, 1992). Expanded and updated edition.