14 research outputs found
The Pythagorean Won-Loss Formula and Hockey: A Statistical Justification for Using the Classic Baseball Formula as an Evaluative Tool in Hockey
Originally devised for baseball, the Pythagorean Won-Loss formula estimates
the percentage of games a team should have won at a particular point in a
season. For decades, this formula had no mathematical justification. In 2006,
Steven Miller provided a statistical derivation by making some heuristic
assumptions about the distributions of runs scored and allowed by baseball
teams. We make a similar set of assumptions about hockey teams and show that
the formula is just as applicable to hockey as it is to baseball. We hope that
this work spurs research in the use of the Pythagorean Won-Loss formula as an
evaluative tool for sports outside baseball.Comment: 21 pages, 4 figures; Forthcoming in The Hockey Research Journal: A
Publication of the Society for International Hockey Research, 2012/1
First Order Approximations of the Pythagorean Won-Loss Formula for Predicting MLB Teams' Winning Percentages
We mathematically prove that an existing linear predictor of baseball teams'
winning percentages (Jones and Tappin 2005) is simply just a first-order
approximation to Bill James' Pythagorean Won-Loss formula and can thus be
written in terms of the formula's well-known exponent. We estimate the linear
model on twenty seasons of Major League Baseball data and are able to verify
that the resulting coefficient estimate, with 95% confidence, is virtually
identical to the empirically accepted value of 1.82. Our work thus helps
explain why this simple and elegant model is such a strong linear predictor.Comment: 7 pages, 1 Table, Appendix with Alternative Proof; By the Numbers 21,
201
Contributions to Bayesian Statistical Modeling in Public Policy Research
This dissertation improves existing Bayesian statistical methodologies and applies these improvements to a variety of important public policy questions. The manuscript is divided into six chapters. The first chapter provides an overview of the various chapters of the dissertation. The second chapter improves existing Bayesian binary logistic regression methodologies using polynomial expansions as an alternative to existing Markov Chain Monte Carlo (MCMC) methods. Our improvements make the estimation technique quite useful for a variety of applications. We also demonstrate the methodology to be considerably faster than existing MCMC methods. These computational gains are quite useful for models analyzing large data sets involving high-dimensional parameter spaces. We apply this methodology to a child poverty data set to analyze the potential causes of child poverty. The next chapter improves upon a well-known technique in semiparametric modeling known as density ratio estimation. This methodology is useful in principle; however, it suffers from one primary limitation - The technique has thus far been incapable of modeling individual-level heterogeneity. Modeling heterogeneity is important as there is often no a priori reason to believe that different individuals (or observations) in a data set will behave in an identical manner. We ameliorate this limitation in the third chapter of this dissertation by adapting density ratio estimation methods to accommodate individual-level heterogeneity. We apply this new methodology to an analysis of the efficacy of medical malpractice reform across the country. In the fourth chapter of this dissertation, we shift our focus toward improving Bayesian credible interval estimation via semiparametric density ratio estimation. We do so by applying an innovative adaptation of the methodology, known as out of sample fusion, to posterior samples from a hierarchical Bayesian linear model looking at the efficacy of the welfare reform of the 1990s. In the fifth chapter, we extend the application of this methodology to credible interval estimation of a hierarchical generalized linear model used for analyzing terrorism data in a number of major conflicts across the globe. We use our results to offer some prescriptive policy suggestions regarding counterterrorism policy. The final chapter concludes the dissertation and offers a number of suggestions for further research. We emphasize that the modeling contributions presented in this dissertation are useful in myriads of other applied problems beyond just the public policy applications presented here
Unfounded FUND: Yet Another EPA Model Not Ready for the Big Game
n Using the OMB-mandated discount rate of 7 percent, the Climate Framework for Uncertainty, Negotiation and Distribution (FUND) model suggests an average social cost of carbon (SCC) of essentially zero dollars, suggesting no net economic damages of global warming. n Upon using the OMB-mandated discount rate in conjunction with updating the equilibrium climate sensitivity distribution, the model reduces its estimate of the SCC for 2020 by nearly $34 a ton (a drop of more than 102 percent). n The FUND model even allows negative estimates of the SCC. In some instances, the chance of the SCC's being negative is nearly 70 percent. n With such great sensitivity to assumptions producing results all over the map, the FUND model may remain an interesting academic exercise, but it is almost certainly not reliable enough to justify trillions of dollars' worth of additional economic regulations with which to burden the economy. Abstract Th
Applications of Improvements to the Pythagorean Won-Loss Expectation in Optimizing Rosters
Bill James' Pythagorean formula has for decades done an excellent job
estimating a baseball team's winning percentage from very little data: if the
average runs scored and allowed are denoted respectively by and
, there is some such that the winning percentage is
approximately . One
important consequence is to determine the value of different players to the
team, as it allows us to estimate how many more wins we would have given a
fixed increase in run production. We summarize earlier work on the subject, and
extend the earlier theoretical model of Miller (who estimated the run
distributions as arising from independent Weibull distributions with the same
shape parameter; this has been observed to describe the observed run data
well). We now model runs scored and allowed as being drawn from independent
Weibull distributions where the shape parameter is not necessarily the same,
and then use the Method of Moments to solve a system of four equations in four
unknowns. Doing so yields a predicted winning percentage that is consistently
better than earlier models over the last 30 MLB seasons (1994 to 2023). This
comes at a small cost as we no longer have a closed form expression but must
evaluate a two-dimensional integral of two Weibull distributions and
numerically estimate the solutions to the system of equations; as these are
trivial to do with simple computational programs it is well worth adopting this
framework and avoiding the issues of implementing the Method of Least Squares
or the Method of Maximum Likelihood
Closed-Form Bayesian Inferences for the Logit Model via Polynomial Expansions
Articles in Marketing and choice literatures have demonstrated the need for
incorporating person-level heterogeneity into behavioral models (e.g., logit
models for multiple binary outcomes as studied here). However, the logit
likelihood extended with a population distribution of heterogeneity doesn't
yield closed-form inferences, and therefore numerical integration techniques
are relied upon (e.g., MCMC methods).
We present here an alternative, closed-form Bayesian inferences for the logit
model, which we obtain by approximating the logit likelihood via a polynomial
expansion, and then positing a distribution of heterogeneity from a flexible
family that is now conjugate and integrable. For problems where the response
coefficients are independent, choosing the Gamma distribution leads to rapidly
convergent closed-form expansions; if there are correlations among the
coefficients one can still obtain rapidly convergent closed-form expansions by
positing a distribution of heterogeneity from a Multivariate Gamma
distribution. The solution then comes from the moment generating function of
the Multivariate Gamma distribution or in general from the multivariate
heterogeneity distribution assumed.
Closed-form Bayesian inferences, derivatives (useful for elasticity
calculations), population distribution parameter estimates (useful for
summarization) and starting values (useful for complicated algorithms) are
hence directly available. Two simulation studies demonstrate the efficacy of
our approach.Comment: 30 pages, 2 figures, corrected some typos. Appears in Quantitative
Marketing and Economics vol 4 (2006), no. 2, 173--20
Closed Form Bayesian Inferences for Binary Logistic Regression with Applications to American Voter Turnout
Understanding the factors that influence voter turnout is a fundamentally important question in public policy and political science research. Bayesian logistic regression models are useful for incorporating individual level heterogeneity to answer these and many other questions. When these questions involve incorporating individual level heterogeneity for large data sets that include many demographic and ethnic subgroups, however, standard Markov Chain Monte Carlo (MCMC) sampling methods to estimate such models can be quite slow and impractical to perform in a reasonable amount of time. We present an innovative closed form Empirical Bayesian approach that is significantly faster than MCMC methods, thus enabling the estimation of voter turnout models that had previously been considered computationally infeasible. Our results shed light on factors impacting voter turnout data in the 2000, 2004, and 2008 presidential elections. We conclude with a discussion of these factors and the associated policy implications. We emphasize, however, that although our application is to the social sciences, our approach is fully generalizable to the myriads of other fields involving statistical models with binary dependent variables and high-dimensional parameter spaces as well
Closed Form Bayesian Inferences for Binary Logistic Regression with Applications to American Voter Turnout
Understanding the factors that influence voter turnout is a fundamentally important question in public policy and political science research. Bayesian logistic regression models are useful for incorporating individual level heterogeneity to answer these and many other questions. When these questions involve incorporating individual level heterogeneity for large data sets that include many demographic and ethnic subgroups, however, standard Markov Chain Monte Carlo (MCMC) sampling methods to estimate such models can be quite slow and impractical to perform in a reasonable amount of time. We present an innovative closed form Empirical Bayesian approach that is significantly faster than MCMC methods, thus enabling the estimation of voter turnout models that had previously been considered computationally infeasible. Our results shed light on factors impacting voter turnout data in the 2000, 2004, and 2008 presidential elections. We conclude with a discussion of these factors and the associated policy implications. We emphasize, however, that although our application is to the social sciences, our approach is fully generalizable to the myriads of other fields involving statistical models with binary dependent variables and high-dimensional parameter spaces as well