387 research outputs found

    The impact of violating the independence assumption in meta-analysis on biomarker discovery

    Get PDF
    With rapid advancements in high-throughput sequencing technologies, massive amounts of “-omics” data are now available in almost every biomedical field. Due to variance in biological models and analytic methods, findings from clinical and biological studies are often not generalizable when tested in independent cohorts. Meta-analysis, a set of statistical tools to integrate independent studies addressing similar research questions, has been proposed to improve the accuracy and robustness of new biological insights. However, it is common practice among biomarker discovery studies using preclinical pharmacogenomic data to borrow molecular profiles of cancer cell lines from one study to another, creating dependence across studies. The impact of violating the independence assumption in meta-analyses is largely unknown. In this study, we review and compare different meta-analyses to estimate variations across studies along with biomarker discoveries using preclinical pharmacogenomics data. We further evaluate the performance of conventional meta-analysis where the dependence of the effects was ignored via simulation studies. Results show that, as the number of non-independent effects increased, relative mean squared error and lower coverage probability increased. Additionally, we also assess potential bias in the estimation of effects for established meta-analysis approaches when data are duplicated and the assumption of independence is violated. Using pharmacogenomics biomarker discovery, we find that treating dependent studies as independent can substantially increase the bias of meta-analyses. Importantly, we show that violating the independence assumption decreases the generalizability of the biomarker discovery process and increases false positive results, a key challenge in precision oncology

    Study of meta-analysis strategies for network inference using information-theoretic approaches

    Get PDF
    © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Reverse engineering of gene regulatory networks (GRNs) from gene expression data is a classical challenge in systems biology. Thanks to high-throughput technologies, a massive amount of gene-expression data has been accumulated in the public repositories. Modelling GRNs from multiple experiments (also called integrative analysis) has; therefore, naturally become a standard procedure in modern computational biology. Indeed, such analysis is usually more robust than the traditional approaches focused on individual datasets, which typically suffer from some experimental bias and a small number of samples. To date, there are mainly two strategies for the problem of interest: the first one (”data merging”) merges all datasets together and then infers a GRN whereas the other (”networks ensemble”) infers GRNs from every dataset separately and then aggregates them using some ensemble rules (such as ranksum or weightsum). Unfortunately, a thorough comparison of these two approaches is lacking. In this paper, we evaluate the performances of various metaanalysis approaches mentioned above with a systematic set of experiments based on in silico benchmarks. Furthermore, we present a new meta-analysis approach for inferring GRNs from multiple studies. Our proposed approach, adapted to methods based on pairwise measures such as correlation or mutual information, consists of two steps: aggregating matrices of the pairwise measures from every dataset followed by extracting the network from the meta-matrix.Peer ReviewedPostprint (author's final draft

    Relevance of different prior knowledge sources for inferring gene interaction networks

    Get PDF
    When inferring networks from high-throughput genomic data, one of the main challenges is the subsequent validation of these networks. In the best case scenario, the true network is partially known from previous research results published in structured databases or research articles. Traditionally, inferred networks are validated against these known interactions. Whenever the recovery rate is gauged to be high enough, subsequent high scoring but unknown inferred interactions are deemed good candidates for further experimental validation. Therefore such validation framework strongly depends on the quantity and quality of published interactions and presents serious pitfalls: (1) availability of these known interactions for the studied problem might be sparse; (2) quantitatively comparing different inference algorithms is not trivial; and (3) the use of these known interactions for validation prevents their integration in the inference procedure. The latter is particularly relevant as it has recently been showed that integration of priors during network inference significantly improves the quality of inferred networks. To overcome these problems when validating inferred networks, we recently proposed a data-driven validation framework based on single gene knock-down experiments. Using this framework, we were able to demonstrate the benefits of integrating prior knowledge and expression data. In this paper we used this framework to assess the quality of different sources of prior knowledge on their own and in combination with different genomic data sets in colorectal cancer. We observed that most prior sources lead to significant F-scores. Furthermore, their integration with genomic data leads to a significant increase in F-scores, especially for priors extracted from full text PubMed articles, known co-expression modules and genetic interactions. Lastly, we observed that the results are consistent for three different data sets: experimental knock-down data and two human tumor data sets

    On the distribution of cosine similarity with application to biology

    Full text link
    Cosine similarity is an established similarity metric for computing associations on vectors, and it is commonly used to identify related samples from biological perturbational data. The distribution of cosine similarity changes with the covariance of the data, and this in turn affects the statistical power to identify related signals. The relationship between the mean and covariance of the distribution of the data and the distribution of cosine similarity is poorly understood. In this work, we derive the asymptotic moments of cosine similarity as a function of the data and identify the criteria of the data covariance matrix that minimize the variance of cosine similarity. We find that the variance of cosine similarity is minimized when the eigenvalues of the covariance matrix are equal for centered data. One immediate application of this work is characterizing the null distribution of cosine similarity over a dataset with non-zero covariance structure. Furthermore, this result can be used to optimize over a set of transformations or representations on a dataset to maximize power, recall, or other discriminative metrics, with direct application to noisy biological data. While we consider the specific biological domain of perturbational data analysis, our result has potential application for any use of cosine similarity or Pearson's correlation on data with covariance structure.Comment: 30 pages, 4 figure

    A fuzzy gene expression-based computational approach improves breast cancer prognostication

    Get PDF
    A fuzzy computational approach that takes into account several molecular subtypes in order to provide more accurate breast cancer prognosi
    • …
    corecore