Data Mining Process


Data mining is the process of discovering patters in large data sets. This is done using methods at the intersection of machine learning, statistics, and database systems. This is a (if not the) primary effort (here) at statshacker.

In order to standardize this effort (for which there are several reasons — efficiency, transparency in results reported, etc.), the official data-science framework, the Cross Industry Standard Process for Data Mining (CRISP-DM) [1], is followed.

This framework consists of six steps. These are detailed below, in the context of efforts hereat towards sports analytics (in particular, baseball).

Note that the following tools are (primarily) used: the open-source statistical software package stats++, the relational database management system PostgreSQL, and the C++ interface to this system DB++.


“Business” Understanding

This initial phase focuses on understanding the project objectives and requirements (technically, from a business perspective),  then converting this knowledge into a data-science problem definition.

The objectives and requirements are project dependent [e.g., at statshacker: determining the (exact) run–win relationship, or the (pre-)analysis of the outcome of MLB games]. The only common requirement is that the necessary data be readily available (see below).

The data-science problem is thus (in some sense) self-defined by a single question: How can the question(s) of the project be answered by modeling?

Data Understanding

This phase starts with initial data collection, and proceeds with activities that enable one to become familiar (and related) with the data.

At statshacker: A significant amount of the data considered is publically available, such as event data from MLB Gameday, play-by-play data from Retrosheet, up-to-date statistics reported at, and (in some cases) additional information from the Lahman database. Data sources and available information will be discussed in more detail in a separate article.

Data Preparation

This phase covers all activities needed to construct the final dataset from the initial raw data.

This phase consists of two steps:

First: The data is (usually) processed to a form suitable for mining, but which preserves the information content. At statshacker: This data is then (often) structured and stored in a relational database; considering database normalization.

Note that, to this point, the steps reported are common to most data mining efforts.

Second: Data is selected for modeling. This depends on the “business” understanding (see above).

In addition, pior to feeding into any modeling tool(s), the data is often also pre-processed (as discussed below).


In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values.

Much as for the data preparation (see above), the technique(s) selected are based on the “business” understanding. At statshacker: These range from analytical models to the application of machine learning.

Note that some techniques have specific requirements on the form of the data. This is (often) handeled by pre-processing; and therefore, going back to the data preparation phase is (often) necessary.

Note also that a subtle, but important consideration (technically as part of the data preparation phase) is that the same set of data is used among different techniques (though it may be used so in different ways). Only in this way can meaningful comparisons between them be made.


In this phase, the model(s) developed is thoroughly evaluated and the steps executed to do so are reviewed, to be certain that it properly achieves the “business” objectives.

This (often) involves testing the model.

Note that it is (extremely) important to test on data that was not used for modeling. There are several ways to this, with the most appropriate (often) dependent on the context; such as k-fold cross validation and walkforward analysis.

At the end of this phase, a decision on the use of the data science results should be reached.


In this final phase, the knowledge gained is organized and presented in a way that is useful.

At statshacker: This is presented as and reported in posts, or the application of “live” prediction models. Note that the end user also carries out the deployment effort, deciding how to use the results.

Note that this phase also covers the implementation of this standardized data-science process across statshacker.


[1] CRISP-DM 1.0


About Author

statshacker is an Assistant Professor of Physics and Astronomy at a well-known state university. His research interests involve the development and application of concepts and techniques from the emerging field of data science to study large data sets. Outside of academic research, he is particularly interested in such data sets that arise in sports and finance. Contact: