22,618 research outputs found

    Feature Screening via Distance Correlation Learning

    Full text link
    This paper is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example.Comment: 32 pages, 5 tables and 1 figure. Wei Zhong is the corresponding autho

    Efficient inference for genetic association studies with multiple outcomes

    Full text link
    Combined inference for heterogeneous high-dimensional data is critical in modern biology, where clinical and various kinds of molecular data may be available from a single study. Classical genetic association studies regress a single clinical outcome on many genetic variants one by one, but there is an increasing demand for joint analysis of many molecular outcomes and genetic variants in order to unravel functional interactions. Unfortunately, most existing approaches to joint modelling are either too simplistic to be powerful or are impracticable for computational reasons. Inspired by Richardson et al. (2010, Bayesian Statistics 9), we consider a sparse multivariate regression model that allows simultaneous selection of predictors and associated responses. As Markov chain Monte Carlo (MCMC) inference on such models can be prohibitively slow when the number of genetic variants exceeds a few thousand, we propose a variational inference approach which produces posterior information very close to that of MCMC inference, at a much reduced computational cost. Extensive numerical experiments show that our approach outperforms popular variable selection methods and tailored Bayesian procedures, dealing within hours with problems involving hundreds of thousands of genetic variants and tens to hundreds of clinical or molecular outcomes

    Karl Pearson's meta-analysis revisited

    Full text link
    This paper revisits a meta-analysis method proposed by Pearson [Biometrika 26 (1934) 425--442] and first used by David [Biometrika 26 (1934) 1--11]. It was thought to be inadmissible for over fifty years, dating back to a paper of Birnbaum [J. Amer. Statist. Assoc. 49 (1954) 559--574]. It turns out that the method Birnbaum analyzed is not the one that Pearson proposed. We show that Pearson's proposal is admissible. Because it is admissible, it has better power than the standard test of Fisher [Statistical Methods for Research Workers (1932) Oliver and Boyd] at some alternatives, and worse power at others. Pearson's method has the advantage when all or most of the nonzero parameters share the same sign. Pearson's test has proved useful in a genomic setting, screening for age-related genes. This paper also presents an FFT-based method for getting hard upper and lower bounds on the CDF of a sum of nonnegative random variables.Comment: Published in at http://dx.doi.org/10.1214/09-AOS697 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A two-phase approach for detecting recombination in nucleotide sequences

    Full text link
    Genetic recombination can produce heterogeneous phylogenetic histories within a set of homologous genes. Delineating recombination events is important in the study of molecular evolution, as inference of such events provides a clearer picture of the phylogenetic relationships among different gene sequences or genomes. Nevertheless, detecting recombination events can be a daunting task, as the performance of different recombinationdetecting approaches can vary, depending on evolutionary events that take place after recombination. We recently evaluated the effects of postrecombination events on the prediction accuracy of recombination-detecting approaches using simulated nucleotide sequence data. The main conclusion, supported by other studies, is that one should not depend on a single method when searching for recombination events. In this paper, we introduce a two-phase strategy, applying three statistical measures to detect the occurrence of recombination events, and a Bayesian phylogenetic approach in delineating breakpoints of such events in nucleotide sequences. We evaluate the performance of these approaches using simulated data, and demonstrate the applicability of this strategy to empirical data. The two-phase strategy proves to be time-efficient when applied to large datasets, and yields high-confidence results.Comment: 5 pages, 3 figures. Chan CX, Beiko RG and Ragan MA (2007). A two-phase approach for detecting recombination in nucleotide sequences. In Hazelhurst S and Ramsay M (Eds) Proceedings of the First Southern African Bioinformatics Workshop, 28-30 January, Johannesburg, 9-1

    Simulation-Based Hypothesis Testing of High Dimensional Means Under Covariance Heterogeneity

    Get PDF
    In this paper, we study the problem of testing the mean vectors of high dimensional data in both one-sample and two-sample cases. The proposed testing procedures employ maximum-type statistics and the parametric bootstrap techniques to compute the critical values. Different from the existing tests that heavily rely on the structural conditions on the unknown covariance matrices, the proposed tests allow general covariance structures of the data and therefore enjoy wide scope of applicability in practice. To enhance powers of the tests against sparse alternatives, we further propose two-step procedures with a preliminary feature screening step. Theoretical properties of the proposed tests are investigated. Through extensive numerical experiments on synthetic datasets and an human acute lymphoblastic leukemia gene expression dataset, we illustrate the performance of the new tests and how they may provide assistance on detecting disease-associated gene-sets. The proposed methods have been implemented in an R-package HDtest and are available on CRAN.Comment: 34 pages, 10 figures; Accepted for biometric
    • …
    corecore