Baseball parks can have a significant effect on the events that occur in them.

Park factors can be used to quantify these effects, in order to understand the context of data that have been collected.

There is a lot of confusion and/or disagreement though about the definition of these, how they should be calculated, applied, etc.

Introduction

As of 1998, there have been Major League Baseball (MLB) teams (see here).

Each of them play in different baseball parks (herein, referred to simply as “parks”) (at least, to different extents).

Parks can have a significant effect on the events that can occur in them. Common examples include the significantly increased carry of fly balls in Coors Field (which, for example, increases home runs), or the increase in prevalence of doubles in Fenway Park.

In order to (meaningfully) consider data for a team, player, etc., calculate value or ability, etc. it is important to consider and understand the context (in this case, the park) in which it has been collected.

Park factors can be used to quantify these effects.

There is a lot of confusion and/or disagreement though about their definition, how they should be calculated, and applied, etc.

Careful consideration reveals that many of these issues are (in some sense) related, and arise only from a simple misunderstanding of their aspects.

The latter follow from the (approach to their) calculation (which differs significantly between sources); while the former certainly follow from their misinterpretation.

This Article is outlined as follows. Park factors are first defined. General considerations of park factors are then made. These are then done so in the context of calculation, following discussion of a necessary assumption. Additional considerations conclude (this part). Results are presented (following the methods used for their calculation), which show the merit of park factors. A discussion and conclusions follow.

Definition

In order to consider park factors in detail, it is necessary to first provide a fundamental [(theoretically) well-defined] definition; as follows:

The park factor for park is that for which converts between a quantity calculated in the park and the park-neutral () one .

Note that “park-neutral” should be considered synonymous with “league-average”.

There are several ways in which this conversion can take place.

Example: Multiplicative Park Factors

The most common method as applied to statistics is multiplicative; that is,

(1)

Note that, despite their common application, careful thought suggests that this is not theoretically correct.

Alternative approach(es) (and further consideration of multiplicative park factors) will be discussed in future article(s).

Considerations

There are several considerations that must be made. Below these are made in the context of park factors, in general.

Static vs. Dynamic Conditions

Park factors are affected by conditions that can be categorized into static or dynamic.

Static Conditions

Static conditions are those that do not change from season-to-season. An example includes the dimensions of the park (assuming that no changes have been made).

Dynamic Conditions

Dynamic conditions are those that possibly do change from season-to-season.

These can be further categorized into predictable and unpredictable.

Predictable conditions are those that may change, but with an effect that may be predicted. An example of this includes the cut of the grass, given the same groundskeeper.

Depending on the source of these conditions, these are perhaps better considered as an additional correction.

Unpredictable-dynamic conditions are not possible to consider as an additional correction. This may include effects too complex to be predictable, (essentially) random, etc.

Some conditions have characteristics that (practically) fall into both categories (and to varying extents). One particularly significant example is the weather.

Dynamic conditions are important to consider in the context of calculation duration.

The “Quantity”

Several (if not all) quantities are effected by the park (at least, to some extent).

Often of interest are statistics. Focus below will be on this.

Run park factors are most commonly used.

Parks do not affect all events by the same extent, however. In order to understand this context, and determine a “complete” park-neutral one, the effects on each event should be considered individually, giving component park factors.

In the following, a general statistic will be denoted by .

Normalization of Statistics

An important consideration for statistics is “normalization“.

That is,

(2)

where is the number of occurrences and is the (normalization /)total number of opportunities.

(Using directly would introduce a bias.)

The Denominator

The denominator in Eq. (2) can have important implications on the interpretation (and applications) of park factors.

Consider, as an example, the following two events:

where is strikeout and is a hit.

A common method to normalize these events would be by plate appearances .

One criticism [] of this, however, is that there is an indirect relationship that obscures the actual events of the park. For example, if more batters are striking out, then less are hitting the ball.

On one hand, if one is interesting in adjusting a complete set of statistics to a particular context, then this effect is reasonable.

On the other, if one is interested in isolating the effects of a park on individual events, then one should account for this relationship. One approach [] along these lines is then to then separate non- from hitting events (e.g., the denominator for the latter could be balls in play).

Assumption of Team/Park Interchangeability

In order to calculate park factors, statistics obtained in each park relative to all others (see context, below) need to be determined.

The following assumption is well justified:

The (home) team and park are interchangeable (in consideration of statistics).

This is because approximately half of the statistics collected in any given park involve the home team.

Application to players will be considered below.

Realize though the following sources of potential bias:

• The tendencies of (the players on) a team. If a team has a (park-neutral) tendency towards or against a particular event, this may obscure the (underlying) effect of the park.
• (The related, but in some sense opposite consideration of) the quality of (the players on) a team. In other words, not all (categories of) players may be affected by park factors in the same way.

Such bias may likely be small though, because the statistics collected for a team (in total, at both home and on the road) are done so over a range of opponents, with varying abilities (see below), and in various parks (on the road).

Such (possible) biases may be further offset (to an extent) though by considering not only the statistics of the home team (in the home park, and as the away team in the road parks), but also by the teams being played.

• Baseball teams do not play balanced schedules.

Because of this, the abilities of the opponents faced by a team are not necessarily distributed uniformly about the league average.

This distribution (especially, its uniformity, and closeness to league average) may also bias statistics (by way of analogous considerations as above).

This bias should be less significant though; under the following assumption:

The opponents faced by each team have (approximately) a uniform range of abilities, with an average close to the league average.

Note that this assumption is also important for considering the above (potential) biases small.

A related consideration is that this distribution is the same at home as it is on the road. Otherwise, for the reasons above, there would be an additional source of bias.

Context

Park factors may be calculated in different context. The two common ones are discussed below.

Park factors are most often calculated in a road context.

By this way, statistics for (home) park are calculated directly against those for all other (road) parks .

Example: Multiplicative Park Factors

(3)

where the (double) subscript denotes the number of occurrences and opportunities in park involving team .

This calculation, however, does not satisfy the above definition. [Park factors are defined relative to the league, not the road (directly).]

An analogous way to consider this is that the road statistics (in this context) must be corrected for the fact that the road parks difference from the league average is offset by the park that is being rated.

A reasonably approximate correction (see below) to Eq. (3) is the other parks correction (OPC) []

where is the total number of parks considered.

League Context

Park factors are defined in a league context.

By this way, statistics for park are calculated against to those for the league.

League-Average Value

The league-average value of a statistic , , is given by

(4)

where is the league-weight of park ,

(5)

where subscript has been used to denote the league value, in this case of total number of opportunities by the league . Note that the left-hand side, being weighted by the number of opportunities in each park, gives an average value per opportunity; whereas the right-hand side is an average value per park.

Note that (only) the left-hand side results (precisely) in

that is, the “average” statistic of the league is that statistic of the league.

Example: Multiplicative Park Factors

(6)

where is the league-average value for (home) park , defined by

Realize in the above equation that the weight is given by the league-context [Eq. (5)].

Note that by the approximation (right-hand side) in Eq. (5), Eq. (6) is mathematically equivalent to the calculation in road context with the OPC, if

In practice, this is found to give better results than Eq. (6) directly.

Change in Park Factors (by Context)

Park factors are calculated relative to the other parks (irrespective of context).

They therefore may change from season-to-season, even if the park itself has not.

This issue is important to consider in the context of calculation duration.

Presentation

Once park factors are calculated, consideration must be made about how to present them.

This depends on their intended purpose; and is an issue that must be understood in order to avoid confusion in their application.

Park Context

As calculated above, the park factors are presented in park context.; that they only give (relative — see below) information about the park.

Ability Context

In order to apply park factors to adjust statistics, consideration has to be made to how both are presented.

Example: Multiplicative Park Factors

In park context, the latter can only (meaningfully) be applied to the corresponding home statistics,

(7)

As applied to only a single team (or player) (where would now be statistics for team calculated in park ), however, this disregards approximately half of all statistics.

In order to include road statistics, first note that

(unlike as suggested here). This is because is obtained by team facing only opponent (in park ). The statistics compiled therein are therefore biased by the abilities resulting from that matchup.

By the consideration (see above) that opponents with a range of abilities are faced on the road,

where are statistics collected in all parks and is an effective road park factor.

These considerations can be used to write a total park factor for statistics involving team as

(8)

The approximation includes a to account for an (approximate) average park factor of the road parks, and the in the denominator for that (approximately) half of the opportunities occur at the home and away parks. Note that neither of these are precise, however.

Note that interleague “parks” (following the assumption of team/park interchangeability) can be used in Eq. (8), as long as the corresponding statistics (hence ) were obtained from intraleague games.

Then,

(9)

It is found that using all statistics and the total park factor [as opposed to Eq. (7)]also leads to significantly better results.

Normalization

For consistency with the definition, league statistics should remain unchanged when considered in an ability context.

This observation can be used for normalization.

Example: Multiplicative Park Factor

According to Eq. (1), the league park factor ,

should therefore be .

There are some additional considerations that are often important.

Interleague Games

It is suggested (e.g., see here) that interleague games should not be included in the calculation. In some games, the teams have the designated hitter, and in others, they don’t. The series are also typically not home-and-home series. This leads to an imbalance in the consideration of the distribution over opponents considered at home and on the road (as discussed above).

Following calculation, park factors may be applied (to a much better degree of approximation) (in an ability context) to statistics including interleague games.

Calculation Duration

Calculation duration is an important consideration in the contexts of noise, dynamic conditions, and the change in park factors (by context)).

In any case, calculation based on season information is justified by the assumption of team/park interchangeability and the use of league data (which is perhaps best considered over this duration).

There is a tradeoff though, as indicated above. A longer duration helps to reduce noise resulting from unpredictable dynamic conditions, including random fluctuations. However, such are subject to a loss in accuracy, due to changes indicated above.

In order to account for both of these issues, averages of single-season park factors over a short duration (e.g., three seasons) seem justified.

Additional corrections are sometimes applied to park factors.

An example would be predictable-dynamic conditions (discussed above).

These seem best left as separate adjustments though. Including them can obscure which part of park factors are related to the park, and which to such corrections.

Application to Players

Park factors can be applied to players.

They should not be used to calculate them, however. This is because the (potential) biases in the assumption of team/park interchangeability would be more significant.

Example: Multiplicative Park Factor

In an ability context, and for multiplicative park factors, for example, the only consideration would be adjustment of the number of opportunities used to calculate the total park factor [Eq. (8)].

The Value in Park Factors

It is tempting to think that park factors have intrinsic value. Since they are defined relative to other parks, and may change (by context), however, they do not. Their information content, therefore, must be interpreted carefully and correspondingly.

Applications of Park Factors

There are several applications of park factors.

Two of the most common ones (and a derivative) are discussed below.

Ability

The most fundamental (see below application of park factors is to determine the park-neutral ability of a team, player, etc.

There are several (more specific) applications of this; e.g.,

• determination of (park-neutral) statistics,
• and their consideration in any park context;
• (park-neutral) comparisons of team, player, etc. (statistics);
• (park-neutral) pairwise comparisons between teams, players, etc.,
• and their adjustment to any park context;

where each application essentially follows from that prior.

Value

Value quantifies the “impact” of a team, player, etc.

Park factors can be applied to this, in order to understanding the context which has been played in.

The goal of a baseball team (over a game, or course of games) is to win games.

It is not possible though to quantify the importance of any particular quantity (directly) in terms of wins. It is possible, however, to do so in terms of runs. And, as discussed in a previous article, there is a relationship between these.

Park factors, in the context of value, should therefore be defined in terms of runs.

Value, from Ability

Value may also be calculated from ability.

Note that this is the basis for the aforementioned “fundamental” remark.

In this context, park-neutral ability is first determined. The number of attributable runs is then calculated.

Note that this provides the connection needed to determine value for players, for example.

Methods

This section describes application of the data mining process (detailed here) to this problem.

Data Understanding

Play-by-play data was obtained from Retrosheet.

Data Preparation

Data was processed and stored in a relational database, using the relational database management system PostgreSQL.

Data preparation used DB++ as an interface to PostgreSQL, and bbDBi as an interface to the baseball database.

Model

In order to evaluate park factors, a model is needed.

In this section, such is developed; and in the context of evaluating ability.

Teams

Consider a league with teams labeled by .

Over a season, each team plays each other one a number of times (at home or on the road).

Sample Space

The statistics considered follow from a batting-event model for baseball (expanded by one event — see below).

The sample space for this model is as follows:

where is strikeout, is non-intentional base on balls, is intentional base on balls, is hit by pitch, is out, is single, is double, is triple, is home run, is sacrifice hit, is sacrifice fly, and is catcher’s interference.

There is, in general, insufficient data to accurately calculate a (component) park factor for (in most cases, there are zero occurrences for each team in some parks; and, in some cases, zero occurrences total for a team).

In addition, this event is not a “pure” batting one [factoring into (or )], as it is most likely to occur on an attempted steal.

For the calculations below, the park factor for is approximated as .

Park Factors

(Component) park factors were calculated according to the considerations above.

Expected Ability

The expected ability of a team playing another in a particular park can be calculated from the method of pairwise comparisons.

For baseball, the standard method for this is the log5 method (for in this context, see Ref. []).

Note that a notable bias has been observed for this method (see here). It should still be accurate enough though to draw significant conclusions for the calculations herein.

The log5 method, in this context, may be formulated as below.

Note that notation for the following is redefined/simplified (relative to above) as follows.

Consider two teams, labeled by the subscripts and . ( is used here to denote the second team, as the context below is such that team is playing in park .)

Consider again a general statistic , that represents the probability of some event. Denote this statistic for team defined per or by , and for team defined per or by , respectively (in a matchup, a for one team must be a by the other, and vice versa).

The expected (outcome) statistic of this event of a Bernoulli trial is then calculated as

(10)

where is the statistic for the league.

Under consideration of park factors (i.e., that they can influence statistics), Eq. (10) only applies in a (total) park-neutral context. This includes statistics for both teams, and that expected.

The following approach is used to account for this:

Step 1. Calculate the park-neutral statistics (see above) for both teams and , and , respectively [see Eq. (9)].

Note that, by the discussion related to league context and normalization, is already in such context.

Step 2. Calculate the park-neutral expected statistic ,

Step 3. Adjust the park-neutral statistic to the context of park ,

Notice the assymetry in using the to adjust for each team to a park-neutral context [Eq. (9)], whereas is used to adjust back to park [Eq. (1)].

Note that in Step 3,

Simple normalization can be used to preserve relative probabilities.

Evaluation

Evaluation is based on comparing the expected to actual ability of each team playing in all road parks.

Note that even though only road games are considered, this approach implicitly considers all data; and only once. Consider the and by team in park . Because a for one team must be for the other (and vice versa), the home results for team (against team ) are therefore also considered.

This is quantified by the Brier score ,

(11)

where the sums over and accounts for each team playing in all road parks — hence, with normalization , and the sum over accounts for the sample space — with normalization of to consider both plate appearances and batters faced by pitcher for team (designated by subscripts), is the probability that was forecast {the general (multievent) matchup formula [], an extension to that discussed above} (see above (link — expected ability)) and is the actual outcome.

In order to “reliably” calculate the quantities in Eq. (11), one must consider sample size; in particular: how many (or ) are necessary to reliably calculate the probabilities. Defining “reliable” to be the point at which the signal-to-noise crosses the halfway point, it has been shown [] that (or ) are needed to converge all statistics.

Results

Results [Eq. (11)] calculated without and with adjustment of statistics by park factors for the 2010–2017 seasons of Major League Baseball are reported in the following tables:

National League:

SeasonNo PFPF
20100.002120.00202
20110.002100.00197
20120.002220.00198
20130.002340.00198
20140.002530.00218
20150.002510.00227
20160.002350.00216
20170.002420.00221

American League:

SeasonNo PFPF
20100.002200.00205
20110.002210.00204
20120.002220.00221
20130.002020.00191
20140.002190.00200
20150.002240.00204
20160.002480.00226
20170.002100.00198

The use of park factors leads to a noticeable improvement; on average, there is a relative decrease in Brier Score by just over %.

Discussion and Conclusions

Park factors were considered in detail.

While a general definition was provided, their specific formulation and presentation requires several considerations.

Significant attention (in particular, by way of examples) was made to the type of park factor often encountered — multiplicative park factors.

In an effort of transparency, it is recommended that all park factors developed and published explicitly state

• the considerations that enter into their formulation,
• how they are to be applied.

A model for the evaluation of park factors was developed, in order to assess their accuracy.

This was applied to the 2010–2017 seasons for MLB. It was shown that they do filter out some of the (park) bias in reported statistics; in other words, they are indeed useful for determining park-neutral ability.

Published Park Factors

Some published park factors are considered below.

Note that these do not compose a comprehensive list. Rather, brief comments may be made about those listed, in order to highlight some of the issues discussed above.

ESPN: ESPN publishes park factors following the approach leading to Eq. (3), without the OPC.

These are presented in a park context.

Baseball-Reference.com: Baseball-Reference.com publishes park factors, following to the approach leading to Eq. (3), and including the OPC.

Additional corrections are included to account for that players (on a team) do not have to face their own players.

A discussion of their calculation or park factors is given here.

There are several additional websites that report park factors, which can be revealed by a simple Google search.

References

[] J. Furtado, “Park Effects,” Baseball Think Factory [online](1997)

[]. The OPC was originally published by (the now defunct) TotalBaseball.com website (archived copy here).

[] B. James, “Log5 Method,” The Bill James Baseball Abstract, pp. 12–13 (1983)

[] M. Haechrel, “Matchup Probabilities in Major League Baseball,” The Baseball Research Journal 43, (2014)

[] R. A. Carleton, “Baseball Therapy: It’s a Small Sample Size After All,” Baseball Prospectus [online](2012)

Share.