22,618 research outputs found
Feature Screening via Distance Correlation Learning
This paper is concerned with screening features in ultrahigh dimensional data
analysis, which has become increasingly important in diverse scientific fields.
We develop a sure independence screening procedure based on the distance
correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the
sure independence screening procedure based on the Pearson correlation (SIS,
for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly
improve the SIS. Fan and Lv (2008) established the sure screening property for
the SIS based on linear models, but the sure screening property is valid for
the DC-SIS under more general settings including linear models. Furthermore,
the implementation of the DC-SIS does not require model specification (e.g.,
linear model or generalized linear model) for responses or predictors. This is
a very appealing property in ultrahigh dimensional data analysis. Moreover, the
DC-SIS can be used directly to screen grouped predictor variables and for
multivariate response variables. We establish the sure screening property for
the DC-SIS, and conduct simulations to examine its finite sample performance.
Numerical comparison indicates that the DC-SIS performs much better than the
SIS in various models. We also illustrate the DC-SIS through a real data
example.Comment: 32 pages, 5 tables and 1 figure. Wei Zhong is the corresponding
autho
Efficient inference for genetic association studies with multiple outcomes
Combined inference for heterogeneous high-dimensional data is critical in
modern biology, where clinical and various kinds of molecular data may be
available from a single study. Classical genetic association studies regress a
single clinical outcome on many genetic variants one by one, but there is an
increasing demand for joint analysis of many molecular outcomes and genetic
variants in order to unravel functional interactions. Unfortunately, most
existing approaches to joint modelling are either too simplistic to be powerful
or are impracticable for computational reasons. Inspired by Richardson et al.
(2010, Bayesian Statistics 9), we consider a sparse multivariate regression
model that allows simultaneous selection of predictors and associated
responses. As Markov chain Monte Carlo (MCMC) inference on such models can be
prohibitively slow when the number of genetic variants exceeds a few thousand,
we propose a variational inference approach which produces posterior
information very close to that of MCMC inference, at a much reduced
computational cost. Extensive numerical experiments show that our approach
outperforms popular variable selection methods and tailored Bayesian
procedures, dealing within hours with problems involving hundreds of thousands
of genetic variants and tens to hundreds of clinical or molecular outcomes
Karl Pearson's meta-analysis revisited
This paper revisits a meta-analysis method proposed by Pearson [Biometrika 26
(1934) 425--442] and first used by David [Biometrika 26 (1934) 1--11]. It was
thought to be inadmissible for over fifty years, dating back to a paper of
Birnbaum [J. Amer. Statist. Assoc. 49 (1954) 559--574]. It turns out that the
method Birnbaum analyzed is not the one that Pearson proposed. We show that
Pearson's proposal is admissible. Because it is admissible, it has better power
than the standard test of Fisher [Statistical Methods for Research Workers
(1932) Oliver and Boyd] at some alternatives, and worse power at others.
Pearson's method has the advantage when all or most of the nonzero parameters
share the same sign. Pearson's test has proved useful in a genomic setting,
screening for age-related genes. This paper also presents an FFT-based method
for getting hard upper and lower bounds on the CDF of a sum of nonnegative
random variables.Comment: Published in at http://dx.doi.org/10.1214/09-AOS697 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
A two-phase approach for detecting recombination in nucleotide sequences
Genetic recombination can produce heterogeneous phylogenetic histories within
a set of homologous genes. Delineating recombination events is important in the
study of molecular evolution, as inference of such events provides a clearer
picture of the phylogenetic relationships among different gene sequences or
genomes. Nevertheless, detecting recombination events can be a daunting task,
as the performance of different recombinationdetecting approaches can vary,
depending on evolutionary events that take place after recombination. We
recently evaluated the effects of postrecombination events on the prediction
accuracy of recombination-detecting approaches using simulated nucleotide
sequence data. The main conclusion, supported by other studies, is that one
should not depend on a single method when searching for recombination events.
In this paper, we introduce a two-phase strategy, applying three statistical
measures to detect the occurrence of recombination events, and a Bayesian
phylogenetic approach in delineating breakpoints of such events in nucleotide
sequences. We evaluate the performance of these approaches using simulated
data, and demonstrate the applicability of this strategy to empirical data. The
two-phase strategy proves to be time-efficient when applied to large datasets,
and yields high-confidence results.Comment: 5 pages, 3 figures. Chan CX, Beiko RG and Ragan MA (2007). A
two-phase approach for detecting recombination in nucleotide sequences. In
Hazelhurst S and Ramsay M (Eds) Proceedings of the First Southern African
Bioinformatics Workshop, 28-30 January, Johannesburg, 9-1
Simulation-Based Hypothesis Testing of High Dimensional Means Under Covariance Heterogeneity
In this paper, we study the problem of testing the mean vectors of high
dimensional data in both one-sample and two-sample cases. The proposed testing
procedures employ maximum-type statistics and the parametric bootstrap
techniques to compute the critical values. Different from the existing tests
that heavily rely on the structural conditions on the unknown covariance
matrices, the proposed tests allow general covariance structures of the data
and therefore enjoy wide scope of applicability in practice. To enhance powers
of the tests against sparse alternatives, we further propose two-step
procedures with a preliminary feature screening step. Theoretical properties of
the proposed tests are investigated. Through extensive numerical experiments on
synthetic datasets and an human acute lymphoblastic leukemia gene expression
dataset, we illustrate the performance of the new tests and how they may
provide assistance on detecting disease-associated gene-sets. The proposed
methods have been implemented in an R-package HDtest and are available on CRAN.Comment: 34 pages, 10 figures; Accepted for biometric
- …