1,920 research outputs found

    A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities

    Full text link
    Analysis of multivariate data sets from e.g. microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists with respect to sampling variations, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles in the cell. In this paper we use the concept of exchangeability of random variables to model this functional redundancy and thereby account for the instability attributable to sampling variations. We present a flexible framework to incorporate the exchangeability into the representation of lists. The proposed framework supports straightforward robust comparison between any two lists. It can also be used to generate new, more stable gene rankings incorporating more information from the experimental data. Using a microarray data set from lung cancer patients we show that the proposed method provides more robust gene rankings than existing methods with respect to sampling variations, without compromising the biological significance

    Using the bootstrap to quantify the authority of an empirical ranking

    Full text link
    The bootstrap is a popular and convenient method for quantifying the authority of an empirical ordering of attributes, for example of a ranking of the performance of institutions or of the influence of genes on a response variable. In the first of these examples, the number, pp, of quantities being ordered is sometimes only moderate in size; in the second it can be very large, often much greater than sample size. However, we show that in both types of problem the conventional bootstrap can produce inconsistency. Moreover, the standard nn-out-of-nn bootstrap estimator of the distribution of an empirical rank may not converge in the usual sense; the estimator may converge in distribution, but not in probability. Nevertheless, in many cases the bootstrap correctly identifies the support of the asymptotic distribution of ranks. In some contemporary problems, bootstrap prediction intervals for ranks are particularly long, and in this context, we also quantify the accuracy of bootstrap methods, showing that the standard bootstrap gets the order of magnitude of the interval right, but not the constant multiplier of interval length. The mm-out-of-nn bootstrap can improve performance and produce statistical consistency, but it requires empirical choice of mm; we suggest a tuning solution to this problem. We show that in genomic examples, where it might be expected that the standard, ``synchronous'' bootstrap will successfully accommodate nonindependence of vector components, that approach can produce misleading results. An ``independent component'' bootstrap can overcome these difficulties, even in cases where components are not strictly independent.Comment: Published in at http://dx.doi.org/10.1214/09-AOS699 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Modeling the variability of rankings

    Full text link
    For better or for worse, rankings of institutions, such as universities, schools and hospitals, play an important role today in conveying information about relative performance. They inform policy decisions and budgets, and are often reported in the media. While overall rankings can vary markedly over relatively short time periods, it is not unusual to find that the ranks of a small number of "highly performing" institutions remain fixed, even when the data on which the rankings are based are extensively revised, and even when a large number of new institutions are added to the competition. In the present paper, we endeavor to model this phenomenon. In particular, we interpret as a random variable the value of the attribute on which the ranking should ideally be based. More precisely, if pp items are to be ranked then the true, but unobserved, attributes are taken to be values of pp independent and identically distributed variates. However, each attribute value is observed only with noise, and via a sample of size roughly equal to nn, say. These noisy approximations to the true attributes are the quantities that are actually ranked. We show that, if the distribution of the true attributes is light-tailed (e.g., normal or exponential) then the number of institutions whose ranking is correct, even after recalculation using new data and even after many new institutions are added, is essentially fixed. Formally, pp is taken to be of order nCn^C for any fixed C>0C>0, and the number of institutions whose ranking is reliable depends very little on pp.Comment: Published in at http://dx.doi.org/10.1214/10-AOS794 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

    Get PDF
    We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (IAS), discovery of protein pairs (IPS) and text passages characterizing protein interaction (ISS) in full text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam-detection techniques, as well as an uncertainty-based integration scheme. We also used a Support Vector Machine and the Singular Value Decomposition on the same features for comparison purposes. Our approach to the full text subtasks (protein pair and passage identification) includes a feature expansion method based on word-proximity networks. Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of the measures of performance used in the challenge evaluation (accuracy, F-score and AUC). We also report on a web-tool we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages. Our approach to abstract classification shows that a simple linear model, using relatively few features, is capable of generalizing and uncovering the conceptual nature of protein-protein interaction from the bibliome. Since the novel approach is based on a very lightweight linear model, it can be easily ported and applied to similar problems. In full text problems, the expansion of word features with word-proximity networks is shown to be useful, though the need for some improvements is discussed

    The use of vector bootstrapping to improve variable selection precision in Lasso models

    Get PDF
    The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping. Data were simulated to represent genomic data under a polygenic model as well as under a model with effect sizes representative of typical GWAS results. We compared these approaches to each other as well as to software defaults for the Lasso. Nested cross-validation had the most precise variable selection at small effect sizes. At larger effect sizes, there was no advantage to nesting. We illustrated the nested approach with empirical data comprising SNPs and SNP-SNP interactions from the most significant SNPs in a GWAS of borderline personality symptoms. In the empirical example, we found that the default Lasso selected low-reliability SNPs and interactions which were excluded by bootstrapping

    Stability of gene rankings from RNAi screens

    Get PDF
    Motivation: Genome-wide RNA interference (RNAi) experiments are becoming a widely used approach for identifying intracellular molecular pathways of specific functions. However, detecting all relevant genes involved in a biological process is challenging, because typically only few samples per gene knock-down are available and readouts tend to be very noisy. We investigate the reliability of top scoring hit lists obtained from RNAi screens, compare the performance of different ranking methods, and propose a new ranking method to improve the reproducibility of gene selection. Results: The performance of different ranking methods is assessed by the size of the stable sets they produce, i.e. the subsets of genes which are estimated to be re-selected with high probability in independent validation experiments. Using stability selection, we also define a new ranking method, called stability ranking, to improve the stability of any given base ranking method. Ranking methods based on mean, median, t-test and rank-sum test, and their stability-augmented counterparts are compared in simulation studies and on three microscopy image RNAi datasets. We find that the rank-sum test offers the most favorable trade-off between ranking stability and accuracy and that stability ranking improves the reproducibility of all and the accuracy of several ranking methods. Availability: Stability ranking is freely available as the R/Bioconductor package staRank at http://www.cbg.ethz.ch/software/staRank. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin
    • …
    corecore