1,023 research outputs found

    Building an NCAA Men’s Basketball Predictive Model and Quantifying Its Success

    Get PDF
    Computing and machine learning advancements have led to the creation of many cutting-edge predictive algorithms, some of which have been demonstrated to provide more accurate forecasts than traditional statistical tools. In this manuscript, we provide evidence that the combination of modest statistical methods with informative data can meet or exceed the accuracy of more complex models when it comes to predicting the NCAA men\u27s basketball tournament. First, we describe a prediction model that merges the point spreads set by Las Vegas sportsbooks with possession based team efficiency metrics by using logistic regressions. The set of probabilities generated from this model most accurately predicted the 2014 tournament, relative to approximately 400 competing submissions, as judged by the log loss function. Next, we attempt to quantify the degree to which luck played a role in the success of this model by simulating tournament outcomes under different sets of true underlying game probabilities. We estimate that under the most optimistic of game probability scenarios, our entry had roughly a 12% chance of outscoring all competing submissions and just less than a 50% chance of finishing with one of the ten best scores

    Filling in the Gaps: A Multiple Imputation Approach to Estimating Aging Curves in Baseball

    Get PDF
    In sports, an aging curve depicts the relationship between average performance and age in athletes\u27 careers. This paper investigates the aging curves for offensive players in the Major League Baseball. We study this problem in a missing data context and account for different types of dropouts of baseball players during their careers. In particular, the performance metric associated with the missing seasons is imputed using a multiple imputation model for multilevel data, and the aging curves are constructed based on the imputed datasets. We first perform a simulation study to evaluate the effects of different dropout mechanisms on the estimation of aging curves. Our method is then illustrated with analyses of MLB player data from past seasons. Results suggest an overestimation of the aging curves constructed without imputing the unobserved seasons, whereas a better estimate is achieved with our approach

    Bang the Can Slowly: An Investigation into the 2017 Houston Astros

    Get PDF
    This manuscript is a statistical investigation into the 2017 Major League Baseball scandal involving the Houston Astros, the World Series championship winner that same year. The Astros were alleged to have stolen their opponents' pitching signs in order to provide their batters with a potentially unfair advantage. This work finds compelling evidence that the Astros on-field performance was significantly affected by their sign-stealing ploy and quantifies the effects. The three main findings in the manuscript are: 1) the Astros' odds of swinging at a pitch were reduced by approximately 27% (OR: 0.725, 95% CI: (0.618, 0.850)) when the sign was stolen, 2) when an Astros player swung, the odds of making contact with the ball increased roughly 80% (OR: 1.805, 95% CI: (1.342, 2.675)) on non-fastball pitches, and 3) when the Astros made contact with a ball on a pitch in which the sign was known, the ball's exit velocity (launch speed) increased on average by 2.386 (95% CI: (0.334, 4.451)) miles per hour

    MixMAP: An R Package for Mixed Modeling of Meta-Analysis p Values in Genetic Association Studies

    Get PDF
    Genetic association studies are commonly conducted to identify genes that explain the variability in a measured trait (e.g., disease status or disease progression). Often, results of these studies are summarized in the form of a p value corresponding to a test of association between each single nucleotide polymorphisms (SNPs) and the trait under study. As genes are comprised of multiple SNPs, post hoc approaches are generally applied to determine gene-level association. For example, if any SNP within a gene is significantly associated with the trait at a genome-wide significance level (p < 5 x 10e-8), then the corresponding gene is considered significant. A complementary strategy, termed mix ed modeling of meta-analysis p values (MixMAP) was proposed recently to characterize formally the associations between genes (or gene regions) and a trait based on multiple SNP-level p values. Here, the MixMAP package is presented as a means for implementing the MixMAP procedure in R

    Sure Independence Screening in the Presence of Data That is Missing at Random

    Get PDF
    Variable selection in ultra-high dimensional data sets is an increasingly prevalent issue with the readily available data arising from, for example, genome-wide associations studies or gene expression data. When the dimension of the feature space is exponentially larger than the sample size, it is desirable to screen out unimportant predictors in order to bring the dimension down to a moderate scale. In this paper we consider the case when observations of the predictors are missing at random. We propose performing screening using the marginal linear correlation coefficient between each predictor and the response variable accounting for the missing data using maximum likelihood estimation. This method is shown to have the sure screening property. Moreover, a novel method of screening that uses additional predictors when estimating the correlation coefficient is proposed. Simulations show that simply performing screening using pairwise complete observations is out-performed by both the proposed methods and is not recommended. Finally, the proposed methods are applied to a gene expression study on prostate cancer

    Reference database of teeth images from the Family Bovidae

    Get PDF
    Researchers typically rely on fossils from the Family Bovidae to generate African paleoenvironmental reconstructions due to their strict ecological tendencies. Bovids have dominated the southern African fauna for the past four million years and, therefore, dominate the fossil faunal assemblages, especially isolated teeth. Traditionally, researchers reference modern and fossil comparative collections to identify teeth. However, researchers are limited by the specific type and number of bovids at each institution. B.O.V.I.D. (Bovidae Occlusal Visual IDentification) is a repository of images of the occlusal surface of bovid teeth. The dataset currently includes extant bovids from 7 tribes and 20 species (~3900). B.O.V.I.D. contains two scaled images per specimen: a color and a black and white (binarized) image. The database is a useful reference for identifying bovid teeth. The large sample size also allows one to observe the natural variation that exists in each taxa. The binarized images can be used in statistical shape analyses, such as taxonomic classification. B.O.V.I.D. is a valuable supplement to other methods for taxonomically identifying bovid teeth

    How Often Does the Best Team Win? A Unified Approach to Understanding Randomness in North American Sport

    Get PDF
    Statistical applications in sports have long centered on how to best separate signal (e.g. team talent) from random noise. However, most of this work has concentrated on a single sport, and the development of meaningful cross-sport comparisons has been impeded by the difficulty of translating luck from one sport to another. In this manuscript, we develop Bayesian state-space models using betting market data that can be uniformly applied across sporting organizations to better understand the role of randomness in game outcomes. These models can be used to extract estimates of team strength, the between-season, within-season, and game-to-game variability of team strengths, as well each team’s home advantage. We implement our approach across a decade of play in each of the National Football League (NFL), National Hockey League (NHL), National Basketball Association (NBA), and Major League Baseball (MLB), finding that the NBA demonstrates both the largest dispersion in talent and the largest home advantage, while the NHL and MLB stand out for their relative randomness in game outcomes. We conclude by proposing new metrics for judging competitiveness across sports leagues, both within the regular season and using traditional postseason tournament formats. Although we focus on sports, we discuss a number of other situations in which our generalizable models might be usefully applied

    An Examination of Olympic Sport Climbing Competition Format and Scoring System

    Get PDF
    Sport climbing, which made its Olympic debut at the 2020 Summer Games, generally consists of three separate disciplines: speed climbing, bouldering, and lead climbing. However, the International Olympic Committee (IOC) only allowed one set of medals each for men and women in sport climbing. As a result, the governing body of sport climbing, rather than choosing only one of the three disciplines to include in the Olympics, decided to create a competition combining all three disciplines. In order to determine a winner, a combined scoring system was created using the product of the ranks across the three disciplines to determine an overall score for each climber. In this work, the rank-product scoring system of sport climbing is evaluated through simulation to investigate its general features, specifically, the advancement probabilities and scores for climbers given certain placements. Additionally, analyses of historical climbing contest results are presented and real examples of violations of the independence of irrelevant alternatives are illustrated. Finally, this work finds evidence that the current competition format is putting speed climbers at a disadvantage.Comment: 17 pages, 7 figure

    Are Proxima and Alpha Centauri Gravitationally Bound?

    Get PDF
    Using the most recent kinematic and radial velocity data in the literature, we calculate the binding energy of Proxima Centauri relative to the center of mass of the Alpha Centauri system. When we adopt the centroids of the observed data, we find that the three stars constitute a bound system, albeit with a semi-major axis that is on order the same size as Alpha Centauri AB's Hill radius in the galactic potential. We carry out a Monte Carlo simulation under the assumption that the errors in the observed quantities are uncorrelated. In this simulation, 44% of the trial systems are bound, and systems on the 1-3 sigma tail of the radial velocity distribution can have Proxima currently located near the apastron position of its orbit. Our analysis shows that a further, very significant improvement in the characterization of the system can be gained by obtaining a more accurate measurement of the radial velocity of Proxima Centauri.Comment: 10 pages total, 4 pages of text, 1 page of references, 3 figures, and 2 tables This article will be published in The Astronomical Journa
    • …
    corecore