6,614 research outputs found

    A Comprehensive Review of the Two-Sample Independent or Paired Binary Data, with or without Stratum Effects

    Get PDF
    Various statistical hypotheses testing for discrete or categorical or binary data have been extensively discussed in the literature. A comprehensive review is given for the two-sample binary or categorical data testing methods on data with or without Stratum Effects. The review includes traditional methods such as Fisher’s Exact, Pearson’s Chi-Square, McNemar, Bowker, Stuart-Maxwell, Breslow-Day and, Cochran-Mantel-Haenszel, as well as newly developed ones. We also provide the roadmap, in a figure or diagram format to which methods are available in the literature. In addition, the implementation of these methods in popular statistical software packages such as SAS and/or R is also presented. This will be helpful for researchers to determine which (categorical-data) testing method is available to use in various fields of study such as clinical trials, epidemiology, etc., both for the design phase of a study in prospective study, cross-sectional or retrospective study analysis

    Common Statistical Pitfalls in Basic Science Research

    Get PDF
    In this review, we focused on common sources of confusion and errors in the analysis and interpretation of basic science studies. The issues addressed are seen repeatedly in the authors\u27 editorial experience, and we hope this article will serve as a guide for those who may submit their basic science studies to journals that publish both clinical and basic science research. We have discussed issues related to sample size and power, study design, data analysis, and presentation of results. We then illustrated these issues using a set of examples from basic science research studies

    A statistical framework for testing functional categories in microarray data

    Get PDF
    Ready access to emerging databases of gene annotation and functional pathways has shifted assessments of differential expression in DNA microarray studies from single genes to groups of genes with shared biological function. This paper takes a critical look at existing methods for assessing the differential expression of a group of genes (functional category), and provides some suggestions for improved performance. We begin by presenting a general framework, in which the set of genes in a functional category is compared to the complementary set of genes on the array. The framework includes tests for overrepresentation of a category within a list of significant genes, and methods that consider continuous measures of differential expression. Existing tests are divided into two classes. Class 1 tests assume gene-specific measures of differential expression are independent, despite overwhelming evidence of positive correlation. Analytic and simulated results are presented that demonstrate Class 1 tests are strongly anti-conservative in practice. Class 2 tests account for gene correlation, typically through array permutation that by construction has proper Type I error control for the induced null. However, both Class 1 and Class 2 tests use a null hypothesis that all genes have the same degree of differential expression. We introduce a more sensible and general (Class 3) null under which the profile of differential expression is the same within the category and complement. Under this broader null, Class 2 tests are shown to be conservative. We propose standard bootstrap methods for testing against the Class 3 null and demonstrate they provide valid Type I error control and more power than array permutation in simulated datasets and real microarray experiments.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS146 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Marginal methods and software for clustered data with cluster- and group-size informativeness.

    Get PDF
    Clustered data result when observations have some natural organizational association. In such data, cluster size is defined as the number of observations belonging to a cluster. A phenomenon termed informative cluster size (ICS) occurs when observation outcomes vary in a systematic way related to the cluster size. An additional form of informativeness, termed informative within-cluster group size (IWCGS), arises when the distribution of group-defining categorical covariates within clusters similarly carries information related to outcomes. Standard methods for the marginal analysis of clustered data can produce biased estimates and inference when data have informativeness. A reweighting methodology has been developed that is resistant to ICS and IWCGS bias, and this method has been used to establish clustered data analogs of classical hypothesis tests related to ranks and correlation. In this work, we extend the reweighting methodology to develop a versatile collection of marginal hypothesis tests related to proportions, means, and variances in clustered data that are analogous to classical forms. We evaluate the performance of these tests compared to other cluster-appropriate methods through simulation and show that only reweighted tests maintain appropriate size when data have informativeness. We construct reweighted tests of clustered categorical data using several variance estimators, and demonstrate that the method of variance estimation can have substantial effect on these tests. Additionally, we show that when testing simple hypotheses in data lacking informativeness, reweighted tests can outperform other standard cluster-appropriate methods both in terms of size and power. Combining our novel tests with the existing tests of ranks and correlations, we compile a comprehensive R software package that executes this collection of ICS/IWCGS-appropriate methods through a thoughtful and user-friendly design

    Reduction of Compound Lotteries with Objective Probabilities: Theory and Evidence

    Get PDF
    The reduction of compound lotteries (ROCL) has assumed a central role in the evaluation of behavior towards risk and uncertainty. We present experimental evidence on its validity in the domain of objective probabilities. Our experiment explicitly recognizes the impact that the random lottery incentive mechanism payment procedure may have on preferences, and so we collect data using both “1-in-1” and “1-in-K” payment procedures, where K\u3e1. We do not find violations of ROCL when subjects are presented with only one choice that is played for money. However, when individuals are presented with many choices and random lottery incentive mechanism is used to select one choice for payoff, we do find violations of ROCL. These results are supported by both non-parametric analysis of choice patterns, as well as structural estimation of latent preferences. We find evidence that the model that best describes behavior when subjects make only one choice is the Rank-Dependent Utility model. When subjects face many choices, their behavior is better characterized by our source-dependent version of the Rank-Dependent Utility model which can account for violations of ROCL. We conclude that payment protocols can create distortions in experimental tests of basic axioms of decision theory

    Techniques for handling clustered binary data

    Get PDF
    Bibliography : leaves 143-153.Over the past few decades there has been increasing interest in clustered studies and hence much research has gone into the analysis of data arising from these studies. It is erroneous to treat clustered data, where observations within a cluster are correlated with each other, as one would treat independent data. It has been found that point estimates are not as greatly affected by clustering as are the standard deviations of the estimates. But as a consequence, confidence intervals and hypothesis testing are severely affected. Therefore one has to approach the analysis of clustered data with caution. Methods that specifically deal with correlated data have been developed. Analysis may be further complicated when the outcome variable of interest is binary rather than continuous. Methods for estimation of proportions, their variances, calculation of confidence intervals and a variety of techniques for testing the homogeneity of proportions have been developed over the years (Donner and Klar, 1993; Donner, 1989, and Rao and Scott, 1992). The methods developed within the context of experimental design generally involve incorporating the effect of clustering in the analysis. This cluster effect is quantified by the intracluster correlation and needs to be taken into account when estimating proportions, comparing proportions and in sample size calculations. In the context of observational studies, the effect of clustering is expressed by the design effect which is the inflation in the variance of an estimate that is due to selecting a cluster sample rather than an independent sample. Another important aspect of the analysis of complex sample data that is often neglected is sampling weights. One needs to recognise that each individual may not have the same probability of being selected. These weights adjust for this fact (Little et al, 1997). Methods for modelling correlated binary data have also been discussed quite extensively. Among the many models which have been proposed for analyzing binary clustered data are two approaches which have been studied and compared: the population-averaged and cluster-specific approach. The population-averaged model focuses on estimating the effect of a set of covariates on the marginal expectation of the response. One example of the population-averaged approach for parameter estimation is known as generalized estimating equations, proposed by Liang and Zeger (1986). It involves assuming that elements within a cluster are independent and then imposing a correlation structure on the set of responses. This is a useful application in longitudinal studies where a subject is regarded as a cluster. Then the parameters describe how the population-averaged response rather than a specific subject's response depends on the covariates of interest. On the other hand, cluster specific models introduce cluster to cluster variability in the model by including random effects terms, which are specific to the cluster, as linear predictors in the regression model (Neuhaus et al, 1991). Unlike the special case of correlated Gaussian responses, the parameters for the cluster specific model obtained for binary data describe different effects on the responses compared to that obtained from the population-averaged model. For longitudinal data, the parameters of a cluster-specific model describe how a specific individuals probability of a response depends on the covariates. The decision to use either of these modelling methods depends on the questions of interest. Cluster-specific models are useful for studying the effects of cluster-varying covariates and when an individual's response rather than an average population's response is the focus. The population-averaged model is useful when interest lies in how the average response across clusters changes with covariates. A criticism of this approach is that there may be no individual with the characteristics of the population-averaged model

    Change-point analysis of paired allele-specific copy number variation data

    Get PDF
    The recent genome-wide allele-specific copy number variation data enable us to explore two types of genomic information including chromosomal genotype variations as well as DNA copy number variations. For a cancer study, it is common to collect data for paired normal and tumor samples. Then, two types of paired data can be obtained to study a disease subject. However, there is a lack of methods for a simultaneous analysis of these four sequences of data. In this study, we propose a statistical framework based on the change-point analysis approach. The validity and usefulness of our proposed statistical framework are demonstrated through the simulation studies and applications based on an experimental data set

    Does published orthodontic research account for clustering effects during statistical data analysis?

    Get PDF
    In orthodontics, multiple site observations within patients or multiple observations collected at consecutive time points are often encountered. Clustered designs require larger sample sizes compared to individual randomized trials and special statistical analyses that account for the fact that observations within clusters are correlated. It is the purpose of this study to assess to what degree clustering effects are considered during design and data analysis in the three major orthodontic journals. The contents of the most recent 24 issues of the American Journal of Orthodontics and Dentofacial Orthopedics (AJODO), Angle Orthodontist (AO), and European Journal of Orthodontics (EJO) from December 2010 backwards were hand searched. Articles with clustering effects and whether the authors accounted for clustering effects were identified. Additionally, information was collected on: involvement of a statistician, single or multicenter study, number of authors in the publication, geographical area, and statistical significance. From the 1584 articles, after exclusions, 1062 were assessed for clustering effects from which 250 (23.5 per cent) were considered to have clustering effects in the design (kappa = 0.92, 95 per cent CI: 0.67-0.99 for inter rater agreement). From the studies with clustering effects only, 63 (25.20 per cent) had indicated accounting for clustering effects. There was evidence that the studies published in the AO have higher odds of accounting for clustering effects [AO versus AJODO: odds ratio (OR) = 2.17, 95 per cent confidence interval (CI): 1.06-4.43, P = 0.03; EJO versus AJODO: OR = 1.90, 95 per cent CI: 0.84-4.24, non-significant; and EJO versus AO: OR = 1.15, 95 per cent CI: 0.57-2.33, non-significant). The results of this study indicate that only about a quarter of the studies with clustering effects account for this in statistical data analysi

    Advanced Data Analysis - Lecture Notes

    Get PDF
    Lecture notes for Advanced Data Analysis (ADA1 Stat 427/527 and ADA2 Stat 428/528), Department of Mathematics and Statistics, University of New Mexico, Fall 2016-Spring 2017. Additional material including RMarkdown templates for in-class and homework exercises, datasets, R code, and video lectures are available on the course websites: https://statacumen.com/teaching/ada1 and https://statacumen.com/teaching/ada2 . Contents I ADA1: Software 0 Introduction to R, Rstudio, and ggplot II ADA1: Summaries and displays, and one-, two-, and many-way tests of means 1 Summarizing and Displaying Data 2 Estimation in One-Sample Problems 3 Two-Sample Inferences 4 Checking Assumptions 5 One-Way Analysis of Variance III ADA1: Nonparametric, categorical, and regression methods 6 Nonparametric Methods 7 Categorical Data Analysis 8 Correlation and Regression IV ADA1: Additional topics 9 Introduction to the Bootstrap 10 Power and Sample size 11 Data Cleaning V ADA2: Review of ADA1 1 R statistical software and review VI ADA2: Introduction to multiple regression and model selection 2 Introduction to Multiple Linear Regression 3 A Taste of Model Selection for Multiple Regression VII ADA2: Experimental design and observational studies 4 One Factor Designs and Extensions 5 Paired Experiments and Randomized Block Experiments 6 A Short Discussion of Observational Studies VIII ADA2: ANCOVA and logistic regression 7 Analysis of Covariance: Comparing Regression Lines 8 Polynomial Regression 9 Discussion of Response Models with Factors and Predictors 10 Automated Model Selection for Multiple Regression 11 Logistic Regression IX ADA2: Multivariate Methods 12 An Introduction to Multivariate Methods 13 Principal Component Analysis 14 Cluster Analysis 15 Multivariate Analysis of Variance 16 Discriminant Analysis 17 Classificationhttps://digitalrepository.unm.edu/unm_oer/1002/thumbnail.jp
    corecore