Baseball parks can have a significant effect on the events that occur in them.
Park factors can be used to quantify these effects, in order to understand the context of data that have been collected.
There is a lot of confusion and/or disagreement though about the definition of these, how they should be calculated, applied, etc.
In this Article, park factors are considered in detail.
Introduction
As of 1998, there have been Major League Baseball (MLB) teams (see here).
Each of them play in different baseball parks (herein, referred to simply as “parks”) (at least, to different extents).
Parks can have a significant effect on the events that can occur in them. Common examples include the significantly increased carry of fly balls in Coors Field (which, for example, increases home runs), or the increase in prevalence of doubles in Fenway Park.
In order to (meaningfully) consider data for a team, player, etc., calculate value or ability, etc. it is important to consider and understand the context (in this case, the park) in which it has been collected.
Park factors can be used to quantify these effects.
There is a lot of confusion and/or disagreement though about their definition, how they should be calculated, and applied, etc.
Careful consideration reveals that many of these issues are (in some sense) related, and arise only from a simple misunderstanding of their aspects.
The latter follow from the (approach to their) calculation (which differs significantly between sources); while the former certainly follow from their misinterpretation.
In this Article, park factors are considered in detail.
This Article is outlined as follows. Park factors are first defined. General considerations of park factors are then made. These are then done so in the context of calculation, following discussion of a necessary assumption. Additional considerations conclude (this part). Results are presented (following the methods used for their calculation), which show the merit of park factors. A discussion and conclusions follow.
Definition
In order to consider park factors in detail, it is necessary to first provide a fundamental [(theoretically) well-defined] definition; as follows:
The park factor for park
is that for which converts between a quantity
calculated in the park
and the park-neutral (
) one
.
Note that “park-neutral” should be considered synonymous with “league-average”.
There are several ways in which this conversion can take place.
Example: Multiplicative Park Factors
The most common method as applied to statistics is multiplicative; that is,
(1)
Note that, despite their common application, careful thought suggests that this is not theoretically correct.
Alternative approach(es) (and further consideration of multiplicative park factors) will be discussed in future article(s).
Considerations
There are several considerations that must be made. Below these are made in the context of park factors, in general.
Static vs. Dynamic Conditions
Park factors are affected by conditions that can be categorized into static or dynamic.
Static Conditions
Static conditions are those that do not change from season-to-season. An example includes the dimensions of the park (assuming that no changes have been made).
Dynamic Conditions
Dynamic conditions are those that possibly do change from season-to-season.
These can be further categorized into predictable and unpredictable.
Predictable conditions are those that may change, but with an effect that may be predicted. An example of this includes the cut of the grass, given the same groundskeeper.
Depending on the source of these conditions, these are perhaps better considered as an additional correction.
Unpredictable-dynamic conditions are not possible to consider as an additional correction. This may include effects too complex to be predictable, (essentially) random, etc.
Some conditions have characteristics that (practically) fall into both categories (and to varying extents). One particularly significant example is the weather.
Dynamic conditions are important to consider in the context of calculation duration.
The “Quantity”
Several (if not all) quantities are effected by the park (at least, to some extent).
Often of interest are statistics. Focus below will be on this.
Run park factors are most commonly used.
Parks do not affect all events by the same extent, however. In order to understand this context, and determine a “complete” park-neutral one, the effects on each event should be considered individually, giving component park factors.
In the following, a general statistic will be denoted by .
Normalization of Statistics
An important consideration for statistics is “normalization“.
That is,
(2)
where is the number of occurrences and
is the (normalization /)total number of opportunities.
(Using directly would introduce a bias.)
The Denominator
The denominator in Eq. (2) can have important implications on the interpretation (and applications) of park factors.
Consider, as an example, the following two events:
where is strikeout and
is a hit.
A common method to normalize these events would be by plate appearances .
One criticism [] of this, however, is that there is an indirect relationship that obscures the actual events of the park. For example, if more batters are striking out, then less are hitting the ball.
On one hand, if one is interesting in adjusting a complete set of statistics to a particular context, then this effect is reasonable.
On the other, if one is interested in isolating the effects of a park on individual events, then one should account for this relationship. One approach [] along these lines is then to then separate non- from hitting events (e.g., the denominator for the latter could be balls in play).
Assumption of Team/Park Interchangeability
In order to calculate park factors, statistics obtained in each park relative to all others (see context, below) need to be determined.
The following assumption is well justified:
The (home) team and park are interchangeable (in consideration of statistics).
This is because approximately half of the statistics collected in any given park involve the home team.
Application to players will be considered below.
Realize though the following sources of potential bias:
- The tendencies of (the players on) a team. If a team has a (park-neutral) tendency towards or against a particular event, this may obscure the (underlying) effect of the park.
- (The related, but in some sense opposite consideration of) the quality of (the players on) a team. In other words, not all (categories of) players may be affected by park factors in the same way.
Such bias may likely be small though, because the statistics collected for a team (in total, at both home and on the road) are done so over a range of opponents, with varying abilities (see below), and in various parks (on the road).
Such (possible) biases may be further offset (to an extent) though by considering not only the statistics of the home team (in the home park, and as the away team in the road parks), but also by the teams being played.
This leads to the following additional consideration, however:
- Baseball teams do not play balanced schedules.
Because of this, the abilities of the opponents faced by a team are not necessarily distributed uniformly about the league average.
This distribution (especially, its uniformity, and closeness to league average) may also bias statistics (by way of analogous considerations as above).
This bias should be less significant though; under the following assumption:
The opponents faced by each team have (approximately) a uniform range of abilities, with an average close to the league average.
Note that this assumption is also important for considering the above (potential) biases small.
A related consideration is that this distribution is the same at home as it is on the road. Otherwise, for the reasons above, there would be an additional source of bias.
Context
Park factors may be calculated in different context. The two common ones are discussed below.
Road Context
Park factors are most often calculated in a road context.
By this way, statistics for (home) park are calculated directly against those for all other (road) parks
.
Example: Multiplicative Park Factors
For multiplicative park factors,
(3)
where the (double) subscript denotes the number of occurrences and opportunities in park
involving team
.
This calculation, however, does not satisfy the above definition. [Park factors are defined relative to the league, not the road (directly).]
An analogous way to consider this is that the road statistics (in this context) must be corrected for the fact that the road parks difference from the league average is offset by the park that is being rated.
A reasonably approximate correction (see below) to Eq. (3) is the other parks correction (OPC) []
where is the total number of parks considered.
League Context
Park factors are defined in a league context.
By this way, statistics for park are calculated against to those for the league.
League-Average Value
The league-average value of a statistic ,
, is given by
(4)
where is the league-weight of park
,
(5)
where subscript has been used to denote the league value, in this case of total number of opportunities by the league
. Note that the left-hand side, being weighted by the number of opportunities in each park, gives an average value per opportunity; whereas the right-hand side is an average value per park.
Note that (only) the left-hand side results (precisely) in
that is, the “average” statistic of the league is that statistic of the league.
Example: Multiplicative Park Factors
For multiplicative park factors,
(6)
where is the league-average value for (home) park
, defined by
Realize in the above equation that the weight is given by the league-context [Eq. (5)].
Note that by the approximation (right-hand side) in Eq. (5), Eq. (6) is mathematically equivalent to the calculation in road context with the OPC, if
In practice, this is found to give better results than Eq. (6) directly.
Change in Park Factors (by Context)
Park factors are calculated relative to the other parks (irrespective of context).
They therefore may change from season-to-season, even if the park itself has not.
This issue is important to consider in the context of calculation duration.
Presentation
Once park factors are calculated, consideration must be made about how to present them.
This depends on their intended purpose; and is an issue that must be understood in order to avoid confusion in their application.
Park Context
As calculated above, the park factors are presented in park context.; that they only give (relative — see below) information about the park.
Ability Context
In order to apply park factors to adjust statistics, consideration has to be made to how both are presented.
Example: Multiplicative Park Factors
In park context, the latter can only (meaningfully) be applied to the corresponding home statistics,
(7)
As applied to only a single team (or player) (where would now be statistics for team
calculated in park
), however, this disregards approximately half of all statistics.
In order to include road statistics, first note that
(unlike as suggested here). This is because is obtained by team
facing only opponent
(in park
). The statistics compiled therein are therefore biased by the abilities resulting from that matchup.
By the consideration (see above) that opponents with a range of abilities are faced on the road,
where are statistics collected in all parks
and
is an effective road park factor.
These considerations can be used to write a total park factor for statistics involving team
as
(8)
The approximation includes a to account for an (approximate) average park factor of the road parks, and the
in the denominator for that (approximately) half of the opportunities occur at the home and away parks. Note that neither of these are precise, however.
Note that interleague “parks” (following the assumption of team/park interchangeability) can be used in Eq. (8), as long as the corresponding statistics (hence
) were obtained from intraleague games.
Then,
(9)
It is found that using all statistics and the total park factor [as opposed to Eq. (7)]also leads to significantly better results.
Normalization
For consistency with the definition, league statistics should remain unchanged when considered in an ability context.
This observation can be used for normalization.
Example: Multiplicative Park Factor
According to Eq. (1), the league park factor ,
should therefore be .
Additional Considerations
There are some additional considerations that are often important.
Interleague Games
It is suggested (e.g., see here) that interleague games should not be included in the calculation. In some games, the teams have the designated hitter, and in others, they don’t. The series are also typically not home-and-home series. This leads to an imbalance in the consideration of the distribution over opponents considered at home and on the road (as discussed above).
Following calculation, park factors may be applied (to a much better degree of approximation) (in an ability context) to statistics including interleague games.
Calculation Duration
Calculation duration is an important consideration in the contexts of noise, dynamic conditions, and the change in park factors (by context)).
In any case, calculation based on season information is justified by the assumption of team/park interchangeability and the use of league data (which is perhaps best considered over this duration).
There is a tradeoff though, as indicated above. A longer duration helps to reduce noise resulting from unpredictable dynamic conditions, including random fluctuations. However, such are subject to a loss in accuracy, due to changes indicated above.
In order to account for both of these issues, averages of single-season park factors over a short duration (e.g., three seasons) seem justified.
Additional Corrections
Additional corrections are sometimes applied to park factors.
An example would be predictable-dynamic conditions (discussed above).
These seem best left as separate adjustments though. Including them can obscure which part of park factors are related to the park, and which to such corrections.
Application to Players
Park factors can be applied to players.
They should not be used to calculate them, however. This is because the (potential) biases in the assumption of team/park interchangeability would be more significant.
Example: Multiplicative Park Factor
In an ability context, and for multiplicative park factors, for example, the only consideration would be adjustment of the number of opportunities used to calculate the total park factor [Eq. (8)].
The Value in Park Factors
It is tempting to think that park factors have intrinsic value. Since they are defined relative to other parks, and may change (by context), however, they do not. Their information content, therefore, must be interpreted carefully and correspondingly.
Applications of Park Factors
There are several applications of park factors.
Two of the most common ones (and a derivative) are discussed below.
Ability
The most fundamental (see below application of park factors is to determine the park-neutral ability of a team, player, etc.
There are several (more specific) applications of this; e.g.,
- determination of (park-neutral) statistics,
and their consideration in any park context;
- (park-neutral) comparisons of team, player, etc. (statistics);
- (park-neutral) pairwise comparisons between teams, players, etc.,
and their adjustment to any park context;
where each application essentially follows from that prior.
Value
Value quantifies the “impact” of a team, player, etc.
Park factors can be applied to this, in order to understanding the context which has been played in.
The goal of a baseball team (over a game, or course of games) is to win games.
It is not possible though to quantify the importance of any particular quantity (directly) in terms of wins. It is possible, however, to do so in terms of runs. And, as discussed in a previous article, there is a relationship between these.
Park factors, in the context of value, should therefore be defined in terms of runs.
Value, from Ability
Value may also be calculated from ability.
Note that this is the basis for the aforementioned “fundamental” remark.
In this context, park-neutral ability is first determined. The number of attributable runs is then calculated.
Note that this provides the connection needed to determine value for players, for example.
Methods
This section describes application of the data mining process (detailed here) to this problem.
Data Understanding
Play-by-play data was obtained from Retrosheet.
Data Preparation
Data was processed and stored in a relational database, using the relational database management system PostgreSQL.
Data preparation used DB++ as an interface to PostgreSQL, and bbDBi as an interface to the baseball database.
Model
In order to evaluate park factors, a model is needed.
In this section, such is developed; and in the context of evaluating ability.
Teams
Consider a league with teams labeled by
.
Over a season, each team plays each other one a number of times (at home or on the road).
Sample Space
The statistics considered follow from a batting-event model for baseball (expanded by one event — see below).
The sample space for this model is as follows:
where is strikeout,
is non-intentional base on balls,
is intentional base on balls,
is hit by pitch,
is out,
is single,
is double,
is triple,
is home run,
is sacrifice hit,
is sacrifice fly, and
is catcher’s interference.
There is, in general, insufficient data to accurately calculate a (component) park factor for (in most cases, there are zero occurrences for each team in some parks; and, in some cases, zero occurrences total for a team).
In addition, this event is not a “pure” batting one [factoring into (or
)], as it is most likely to occur on an attempted steal.
For the calculations below, the park factor for is approximated as
.
Park Factors
(Component) park factors were calculated according to the considerations above.
Expected Ability
The expected ability of a team playing another in a particular park can be calculated from the method of pairwise comparisons.
For baseball, the standard method for this is the log5 method (for in this context, see Ref. []).
Note that a notable bias has been observed for this method (see here). It should still be accurate enough though to draw significant conclusions for the calculations herein.
The log5 method, in this context, may be formulated as below.
Note that notation for the following is redefined/simplified (relative to above) as follows.
Consider two teams, labeled by the subscripts and
. (
is used here to denote the second team, as the context below is such that team
is playing in park
.)
Consider again a general statistic , that represents the probability of some event. Denote this statistic for team
defined per
or
by
, and for team
defined per
or
by
, respectively (in a matchup, a
for one team must be a
by the other, and vice versa).
The expected (outcome) statistic of this event of a Bernoulli trial
is then calculated as
(10)
where is the statistic for the league.
Under consideration of park factors (i.e., that they can influence statistics), Eq. (10) only applies in a (total) park-neutral context. This includes statistics for both teams, and that expected.
The following approach is used to account for this:
Step 1. Calculate the park-neutral statistics (see above) for both teams and
,
and
, respectively [see Eq. (9)].
Note that, by the discussion related to league context and normalization, is already in such context.
Step 2. Calculate the park-neutral expected statistic ,
Step 3. Adjust the park-neutral statistic to the context of park ,
Notice the assymetry in using the to adjust
for each team to a park-neutral context [Eq. (9)], whereas
is used to adjust back to park
[Eq. (1)].
Note that in Step 3,
Simple normalization can be used to preserve relative probabilities.
Evaluation
Evaluation is based on comparing the expected to actual ability of each team playing in all road parks.
Note that even though only road games are considered, this approach implicitly considers all data; and only once. Consider the and
by team
in park
. Because a
for one team must be
for the other (and vice versa), the home results for team
(against team
) are therefore also considered.
This is quantified by the Brier score ,
(11)
where the sums over and
accounts for each team
playing in all road parks
— hence, with normalization
, and the sum over
accounts for the sample space — with normalization of
to consider both plate appearances
and batters faced by pitcher
for team
(designated by subscripts),
is the probability that was forecast {the general (multievent) matchup formula [
], an extension to that discussed above} (see above (link — expected ability)) and
is the actual outcome.
In order to “reliably” calculate the quantities in Eq. (11), one must consider sample size; in particular: how many (or
) are necessary to reliably calculate the probabilities. Defining “reliable” to be the point at which the signal-to-noise crosses the halfway point, it has been shown [
] that
(or
) are needed to converge all statistics.
Results
Results [Eq. (11)] calculated without and with adjustment of statistics by park factors for the 2010–2017 seasons of Major League Baseball are reported in the following tables:
National League:
Season | No PF | PF |
---|---|---|
2010 | 0.00212 | 0.00202 |
2011 | 0.00210 | 0.00197 |
2012 | 0.00222 | 0.00198 |
2013 | 0.00234 | 0.00198 |
2014 | 0.00253 | 0.00218 |
2015 | 0.00251 | 0.00227 |
2016 | 0.00235 | 0.00216 |
2017 | 0.00242 | 0.00221 |
American League:
Season | No PF | PF |
---|---|---|
2010 | 0.00220 | 0.00205 |
2011 | 0.00221 | 0.00204 |
2012 | 0.00222 | 0.00221 |
2013 | 0.00202 | 0.00191 |
2014 | 0.00219 | 0.00200 |
2015 | 0.00224 | 0.00204 |
2016 | 0.00248 | 0.00226 |
2017 | 0.00210 | 0.00198 |
The use of park factors leads to a noticeable improvement; on average, there is a relative decrease in Brier Score by just over %.
Discussion and Conclusions
Park factors were considered in detail.
While a general definition was provided, their specific formulation and presentation requires several considerations.
Significant attention (in particular, by way of examples) was made to the type of park factor often encountered — multiplicative park factors.
In an effort of transparency, it is recommended that all park factors developed and published explicitly state
- the considerations that enter into their formulation,
- how they are to be applied.
A model for the evaluation of park factors was developed, in order to assess their accuracy.
This was applied to the 2010–2017 seasons for MLB. It was shown that they do filter out some of the (park) bias in reported statistics; in other words, they are indeed useful for determining park-neutral ability.
Published Park Factors
Some published park factors are considered below.
Note that these do not compose a comprehensive list. Rather, brief comments may be made about those listed, in order to highlight some of the issues discussed above.
ESPN: ESPN publishes park factors following the approach leading to Eq. (3), without the OPC.
These are presented in a park context.
Baseball-Reference.com: Baseball-Reference.com publishes park factors, following to the approach leading to Eq. (3), and including the OPC.
Additional corrections are included to account for that players (on a team) do not have to face their own players.
A discussion of their calculation or park factors is given here.
FanGraphs: FanGraphs publishes (component) park factors (in addition to traditional ones).
There are several additional websites that report park factors, which can be revealed by a simple Google search.
References
[] J. Furtado, “Park Effects,” Baseball Think Factory [online](1997)
[]. The OPC was originally published by (the now defunct) TotalBaseball.com website (archived copy here).
[] B. James, “Log5 Method,” The Bill James Baseball Abstract, pp. 12–13 (1983)
[] M. Haechrel, “Matchup Probabilities in Major League Baseball,” The Baseball Research Journal 43, (2014)
[] R. A. Carleton, “Baseball Therapy: It’s a Small Sample Size After All,” Baseball Prospectus [online](2012)