5 research outputs found

    Bayesian Inference for Genomic Data Integration Reduces Misclassification Rate in Predicting Protein-Protein Interactions

    Get PDF
    Protein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing PPIs networks. However, several critical difficulties exist in obtaining reliable predictions. Noticeably, false positive rates can be as high as >80%. Error correction from each generating source can be both time-consuming and inefficient due to the difficulty of covering the errors from multiple levels of data processing procedures within a single test. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes to unreliable, error-prone and contaminated data. On a large human data set our NBEL approach predicts many more PPIs than naïve Bayes. This suggests that previous studies may have large numbers of not only false positives but also false negatives. The validation on two human PPIs datasets having high quality supports our observations. Our experiments demonstrate that it is feasible to predict high-throughput PPIs computationally with substantially reduced false positives and false negatives. The ability of predicting large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction and roles of PPIs in disease susceptibility

    The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets

    Get PDF
    Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/

    Systems biology approaches to a rational drug discovery paradigm

    Full text link
    The published manuscript is available at EurekaSelect via http://www.eurekaselect.com/openurl/content.php?genre=article&doi=10.2174/1568026615666150826114524.Prathipati P., Mizuguchi K.. Systems biology approaches to a rational drug discovery paradigm. Current Topics in Medicinal Chemistry, 16, 9, 1009. https://doi.org/10.2174/1568026615666150826114524

    Integrative Analysis Methods for Biological Problems Using Data Reduction Approaches

    Full text link
    The "big data" revolution of the past decade has allowed researchers to procure or access biological data at an unprecedented scale, on the front of both volume (low-cost high-throughput technologies) and variety (multi-platform genomic profiling). This has fueled the development of new integrative methods, which combine and consolidate across multiple sources of data in order to gain generalizability, robustness, and a more comprehensive systems perspective. The key challenges faced by this new class of methods primarily relate to heterogeneity, whether it is across cohorts from independent studies or across the different levels of genomic regulation. While the different perspectives among data sources is invaluable in providing different snapshots of the global system, such diversity also brings forth many analytic difficulties as each source introduces a distinctive element of noise. In recent years, many styles of data integration have appeared to tackle this problem ranging from Bayesian frameworks to graphical models, a wide assortment as diverse as the biology they intend to explain. My focus in this work is dimensionality reduction-based methods of integration, which offer the advantages of efficiency in high-dimensions (an asset among genomic datasets) and simplicity in allowing for elegant mathematical extensions. In the course of these chapters I will describe the biological motivations, the methodological directions, and the applications of three canonical reductionist approaches for relating information across multiple data groups.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138564/1/yangzi_1.pd
    corecore