1,920 research outputs found
A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities
Analysis of multivariate data sets from e.g. microarray studies frequently
results in lists of genes which are associated with some response of interest.
The biological interpretation is often complicated by the statistical
instability of the obtained gene lists with respect to sampling variations,
which may partly be due to the functional redundancy among genes, implying that
multiple genes can play exchangeable roles in the cell. In this paper we use
the concept of exchangeability of random variables to model this functional
redundancy and thereby account for the instability attributable to sampling
variations. We present a flexible framework to incorporate the exchangeability
into the representation of lists. The proposed framework supports
straightforward robust comparison between any two lists. It can also be used to
generate new, more stable gene rankings incorporating more information from the
experimental data. Using a microarray data set from lung cancer patients we
show that the proposed method provides more robust gene rankings than existing
methods with respect to sampling variations, without compromising the
biological significance
Using the bootstrap to quantify the authority of an empirical ranking
The bootstrap is a popular and convenient method for quantifying the
authority of an empirical ordering of attributes, for example of a ranking of
the performance of institutions or of the influence of genes on a response
variable. In the first of these examples, the number, , of quantities being
ordered is sometimes only moderate in size; in the second it can be very large,
often much greater than sample size. However, we show that in both types of
problem the conventional bootstrap can produce inconsistency. Moreover, the
standard -out-of- bootstrap estimator of the distribution of an empirical
rank may not converge in the usual sense; the estimator may converge in
distribution, but not in probability. Nevertheless, in many cases the bootstrap
correctly identifies the support of the asymptotic distribution of ranks. In
some contemporary problems, bootstrap prediction intervals for ranks are
particularly long, and in this context, we also quantify the accuracy of
bootstrap methods, showing that the standard bootstrap gets the order of
magnitude of the interval right, but not the constant multiplier of interval
length. The -out-of- bootstrap can improve performance and produce
statistical consistency, but it requires empirical choice of ; we suggest a
tuning solution to this problem. We show that in genomic examples, where it
might be expected that the standard, ``synchronous'' bootstrap will
successfully accommodate nonindependence of vector components, that approach
can produce misleading results. An ``independent component'' bootstrap can
overcome these difficulties, even in cases where components are not strictly
independent.Comment: Published in at http://dx.doi.org/10.1214/09-AOS699 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Modeling the variability of rankings
For better or for worse, rankings of institutions, such as universities,
schools and hospitals, play an important role today in conveying information
about relative performance. They inform policy decisions and budgets, and are
often reported in the media. While overall rankings can vary markedly over
relatively short time periods, it is not unusual to find that the ranks of a
small number of "highly performing" institutions remain fixed, even when the
data on which the rankings are based are extensively revised, and even when a
large number of new institutions are added to the competition. In the present
paper, we endeavor to model this phenomenon. In particular, we interpret as a
random variable the value of the attribute on which the ranking should ideally
be based. More precisely, if items are to be ranked then the true, but
unobserved, attributes are taken to be values of independent and
identically distributed variates. However, each attribute value is observed
only with noise, and via a sample of size roughly equal to , say. These
noisy approximations to the true attributes are the quantities that are
actually ranked. We show that, if the distribution of the true attributes is
light-tailed (e.g., normal or exponential) then the number of institutions
whose ranking is correct, even after recalculation using new data and even
after many new institutions are added, is essentially fixed. Formally, is
taken to be of order for any fixed , and the number of institutions
whose ranking is reliable depends very little on .Comment: Published in at http://dx.doi.org/10.1214/10-AOS794 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks
We participated in three of the protein-protein interaction subtasks of the
Second BioCreative Challenge: classification of abstracts relevant for
protein-protein interaction (IAS), discovery of protein pairs (IPS) and text
passages characterizing protein interaction (ISS) in full text documents. We
approached the abstract classification task with a novel, lightweight linear
model inspired by spam-detection techniques, as well as an uncertainty-based
integration scheme. We also used a Support Vector Machine and the Singular
Value Decomposition on the same features for comparison purposes. Our approach
to the full text subtasks (protein pair and passage identification) includes a
feature expansion method based on word-proximity networks. Our approach to the
abstract classification task (IAS) was among the top submissions for this task
in terms of the measures of performance used in the challenge evaluation
(accuracy, F-score and AUC). We also report on a web-tool we produced using our
approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our
approach to the full text tasks resulted in one of the highest recall rates as
well as mean reciprocal rank of correct passages. Our approach to abstract
classification shows that a simple linear model, using relatively few features,
is capable of generalizing and uncovering the conceptual nature of
protein-protein interaction from the bibliome. Since the novel approach is
based on a very lightweight linear model, it can be easily ported and applied
to similar problems. In full text problems, the expansion of word features with
word-proximity networks is shown to be useful, though the need for some
improvements is discussed
The use of vector bootstrapping to improve variable selection precision in Lasso models
The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping. Data were simulated to represent genomic data under a polygenic model as well as under a model with effect sizes representative of typical GWAS results. We compared these approaches to each other as well as to software defaults for the Lasso. Nested cross-validation had the most precise variable selection at small effect sizes. At larger effect sizes, there was no advantage to nesting. We illustrated the nested approach with empirical data comprising SNPs and SNP-SNP interactions from the most significant SNPs in a GWAS of borderline personality symptoms. In the empirical example, we found that the default Lasso selected low-reliability SNPs and interactions which were excluded by bootstrapping
Stability of gene rankings from RNAi screens
Motivation: Genome-wide RNA interference (RNAi) experiments are becoming a widely used approach for identifying intracellular molecular pathways of specific functions. However, detecting all relevant genes involved in a biological process is challenging, because typically only few samples per gene knock-down are available and readouts tend to be very noisy. We investigate the reliability of top scoring hit lists obtained from RNAi screens, compare the performance of different ranking methods, and propose a new ranking method to improve the reproducibility of gene selection. Results: The performance of different ranking methods is assessed by the size of the stable sets they produce, i.e. the subsets of genes which are estimated to be re-selected with high probability in independent validation experiments. Using stability selection, we also define a new ranking method, called stability ranking, to improve the stability of any given base ranking method. Ranking methods based on mean, median, t-test and rank-sum test, and their stability-augmented counterparts are compared in simulation studies and on three microscopy image RNAi datasets. We find that the rank-sum test offers the most favorable trade-off between ranking stability and accuracy and that stability ranking improves the reproducibility of all and the accuracy of several ranking methods. Availability: Stability ranking is freely available as the R/Bioconductor package staRank at http://www.cbg.ethz.ch/software/staRank. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin
- …