1,674 research outputs found

    Analysis and Computational Dissection of Molecular Signature Multiplicity

    Get PDF
    Molecular signatures are computational or mathematical models created to diagnose disease and other phenotypes and to predict clinical outcomes and response to treatment. It is widely recognized that molecular signatures constitute one of the most important translational and basic science developments enabled by recent high-throughput molecular assays. A perplexing phenomenon that characterizes high-throughput data analysis is the ubiquitous multiplicity of molecular signatures. Multiplicity is a special form of data analysis instability in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally predictive signatures. This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. We present a formal characterization of signature multiplicity and a new efficient algorithm that offers theoretical guarantees for extracting the set of maximally predictive and non-redundant signatures independent of distribution. The new algorithm identifies exactly the set of optimal signatures in controlled experiments and yields signatures with significantly better predictivity and reproducibility than previous algorithms in human microarray gene expression datasets. Our results shed light on the causes of signature multiplicity, provide computational tools for studying it empirically and introduce a framework for in silico bioequivalence of this important new class of diagnostic and personalized medicine modalities

    Exploring signature multiplicity in microarray data using ensembles of randomized trees

    Get PDF
    A challenging and novel direction for feature selection research in computational biology is the analysis of signature multiplicity. In this work, we propose to investigate the eect of signature multiplicity on feature importance scores derived from tree-based ensemble methods. We show that looking at individual tree rankings in an ensemble could highlight the existence of multiple signatures and we propose a simple post-processing method based on clustering that can return smaller signatures with better predictive performance than signatures derived from the global tree ranking at almost no additional cost

    Expanding the Understanding of Biases in Development of Clinical-Grade Molecular Signatures: A Case Study in Acute Respiratory Viral Infections

    Get PDF
    The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution

    Causal graph-based analysis of genome-wide association data in rheumatoid arthritis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>GWAS owe their popularity to the expectation that they will make a major impact on diagnosis, prognosis and management of disease by uncovering genetics underlying clinical phenotypes. The dominant paradigm in GWAS data analysis so far consists of extensive reliance on methods that emphasize contribution of individual SNPs to statistical association with phenotypes. Multivariate methods, however, can extract more information by considering associations of multiple SNPs simultaneously. Recent advances in other genomics domains pinpoint multivariate causal graph-based inference as a promising principled analysis framework for high-throughput data. Designed to discover biomarkers in the local causal pathway of the phenotype, these methods lead to accurate and highly parsimonious multivariate predictive models. In this paper, we investigate the applicability of causal graph-based method TIE* to analysis of GWAS data. To test the utility of TIE*, we focus on anti-CCP positive rheumatoid arthritis (RA) GWAS datasets, where there is a general consensus in the community about the major genetic determinants of the disease.</p> <p>Results</p> <p>Application of TIE* to the North American Rheumatoid Arthritis Cohort (NARAC) GWAS data results in six SNPs, mostly from the MHC locus. Using these SNPs we develop two predictive models that can classify cases and disease-free controls with an accuracy of 0.81 area under the ROC curve, as verified in independent testing data from the same cohort. The predictive performance of these models generalizes reasonably well to Swedish subjects from the closely related but not identical Epidemiological Investigation of Rheumatoid Arthritis (EIRA) cohort with 0.71-0.78 area under the ROC curve. Moreover, the SNPs identified by the TIE* method render many other previously known SNP associations conditionally independent of the phenotype.</p> <p>Conclusions</p> <p>Our experiments demonstrate that application of TIE* captures maximum amount of genetic information about RA in the data and recapitulates the major consensus findings about the genetic factors of this disease. In addition, TIE* yields reproducible markers and signatures of RA. This suggests that principled multivariate causal and predictive framework for GWAS analysis empowers the community with a new tool for high-quality and more efficient discovery.</p> <p>Reviewers</p> <p>This article was reviewed by Prof. Anthony Almudevar, Dr. Eugene V. Koonin, and Prof. Marianthi Markatou.</p

    Computational investigation of cancer genomes

    Get PDF
    Cancer is a leading cause of death worldwide, and its incidence is increasing due to modern lifestyle that prolonged human life. All cancers originate from a single cell that had acquired genetic aberrations enabling uncontrolled proliferation. Each cancer is unique in its aberrant genetic makeup, which defines, to large extent, its biology, aggressiveness, and vulnerabilities to different treatments. Furthermore, the genetic makeup of each cancer is heterogeneous among its constituent cancer cells, and dynamic with the ability to evolve in order to preserve the survival of cancer cells. Sequencing technologies are currently producing massive amounts of data that, with the help of specialized computational methods, can revolutionize our knowledge on cancer. A key question in cancer research is how to personalize the treatment of cancer patients, so that each cancer is treated according to its molecular characteristics. The first study in this thesis takes a step in that direction through a proposed novel molecular classification system of diffuse large B-cell lymphoma (DLBCL), which is the most common hematological malignancy in adults. The suggested classification, derived from the integrative analysis of gene expression and DNA mutations, stratifies DLBCL into four groups with distinct biology, genetic landscapes, and clinical outcome. These subtypes could help identify patients at high risk who may benefit from an altered treatment plan. Understanding the genomic evolution of cancer that transforms a typically curable primary tumor into an incurable drug-resistant metastasis is another aspect of cancer research under intensive investigation. The second study in this thesis investigates the spreading patterns of metastasis in breast cancer, which is the most common cancer in women. Using phylogenetic analysis of somatic mutations from longitudinal breast cancer samples, the metastasis routes were uncovered. The study revealed that breast cancer spreads either in parallel from primary tumor to multiple distant sites, or linearly from primary tumor to a distant site, and then from that to another. However, in all cases, axillary lymph nodes did not mediate the spreading to distant sites. This provided a genetic-based evidence on the redundancy of lymph node dissection in breast cancer management. Towards a genetic-based diagnostics in cancer, the computational methods used to detect genetic aberrations need to be evaluated for their accuracy. The third study in this thesis performs a comparison of methods for detecting somatic copy number alterations from cancer samples. The study evaluated several commonly used methods for two different sequencing platforms using simulated and real cancer data. The results provided an overview of the weaknesses of the different methods that could be methodologically improved. Altogether, this thesis gives an overview on the field of computational cancer genomics and presents three studies that exemplify the clinical relevance of computational research.Not availabl

    Predictive integration of gene functional similarity and co-expression defines treatment response of endothelial progenitor cells

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Endothelial progenitor cells (EPCs) have been implicated in different processes crucial to vasculature repair, which may offer the basis for new therapeutic strategies in cardiovascular disease. Despite advances facilitated by functional genomics, there is a lack of systems-level understanding of treatment response mechanisms of EPCs. In this research we aimed to characterize the EPCs response to adenosine (Ado), a cardioprotective factor, based on the systems-level integration of gene expression data and prior functional knowledge. Specifically, we set out to identify novel biosignatures of Ado-treatment response in EPCs.</p> <p>Results</p> <p>The predictive integration of gene expression data and standardized functional similarity information enabled us to identify new treatment response biosignatures. Gene expression data originated from Ado-treated and -untreated EPCs samples, and functional similarity was estimated with Gene Ontology (GO)-based similarity information. These information sources enabled us to implement and evaluate an integrated prediction approach based on the concept of <it>k</it>-nearest neighbours learning (<it>k</it>NN). The method can be executed by expert- and data-driven input queries to guide the search for biologically meaningful biosignatures. The resulting <it>integrated kNN </it>system identified new candidate EPC biosignatures that can offer high classification performance (areas under the operating characteristic curve > 0.8). We also showed that the proposed models can outperform those discovered by standard gene expression analysis. Furthermore, we report an initial independent <it>in vitro </it>experimental follow-up, which provides additional evidence of the potential validity of the top biosignature.</p> <p>Conclusion</p> <p>Response to Ado treatment in EPCs can be accurately characterized with a new method based on the combination of gene co-expression data and GO-based similarity information. It also exploits the incorporation of human expert-driven queries as a strategy to guide the automated search for candidate biosignatures. The proposed biosignature improves the systems-level characterization of EPCs. The new integrative predictive modeling approach can also be applied to other phenotype characterization or biomarker discovery problems.</p

    Multiplicity: an organizing principle for cancers and somatic mutations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the advent of whole-genome analysis for profiling tumor tissue, a pressing need has emerged for principled methods of organizing the large amounts of resulting genomic information. We propose the concept of multiplicity measures on cancer and gene networks to organize the information in a clinically meaningful manner. Multiplicity applied in this context extends Fearon and Vogelstein's multi-hit genetic model of colorectal carcinoma across multiple cancers.</p> <p>Methods</p> <p>Using the Catalogue of Somatic Mutations in Cancer (COSMIC), we construct networks of interacting cancers and genes. Multiplicity is calculated by evaluating the number of cancers and genes linked by the measurement of a somatic mutation. The Kamada-Kawai algorithm is used to find a two-dimensional minimum energy solution with multiplicity as an input similarity measure. Cancers and genes are positioned in two dimensions according to this similarity. A third dimension is added to the network by assigning a maximal multiplicity to each cancer or gene. Hierarchical clustering within this three-dimensional network is used to identify similar clusters in somatic mutation patterns across cancer types.</p> <p>Results</p> <p>The clustering of genes in a three-dimensional network reveals a similarity in acquired mutations across different cancer types. Surprisingly, the clusters separate known causal mutations. The multiplicity clustering technique identifies a set of causal genes with an area under the ROC curve of 0.84 versus 0.57 when clustering on gene mutation rate alone. The cluster multiplicity value and number of causal genes are positively correlated via Spearman's Rank Order correlation (<it>r<sub>s</sub></it>(8) = 0.894, Spearman's <it>t </it>= 17.48, <it>p </it>< 0.05). A clustering analysis of cancer types segregates different types of cancer. All blood tumors cluster together, and the cluster multiplicity values differ significantly (Kruskal-Wallis, <it>H </it>= 16.98, <it>df </it>= 2, <it>p </it>< 0.05).</p> <p>Conclusion</p> <p>We demonstrate the principle of multiplicity for organizing somatic mutations and cancers in clinically relevant clusters. These clusters of cancers and mutations provide representations that identify segregations of cancer and genes driving cancer progression.</p

    Advantages of genomic complexity: bioinformatics opportunities in microRNA cancer signatures

    Get PDF
    MicroRNAs, small non-coding RNAs, may act as tumor suppressors or oncogenes, and each regulate their own transcription and that of hundreds of genes, often in a tissue-dependent manner. This creates a tightly interwoven network regulating and underlying oncogenesis and cancer biology. Although protein-coding gene signatures and single protein pathway markers have proliferated over the past decade, routine adoption of the former has been hampered by interpretability, reproducibility, and dimensionality, whereas the single molecule–phenotype reductionism of the latter is often overly simplistic to account for complex phenotypes. MicroRNA-derived biomarkers offer a powerful alternative; they have both the flexibility of gene expression signature classifiers and the desirable mechanistic transparency of single protein biomarkers. Furthermore, several advances have recently demonstrated the robust detection of microRNAs from various biofluids, thus providing an additional opportunity for obtaining bioinformatically derived biomarkers to accelerate the identification of individual patients for personalized therapy

    A biological function based biomarker panel optimization process.

    Get PDF
    Implementation of multi-gene biomarker panels identified from high throughput data, including microarray or next generation sequencing, need to be adapted to a platform suitable in a clinical setting such as quantitative polymerase chain reaction. However, technical challenges when transitioning from one measurement platform to another, such as inconsistent measurement results can affect panel development. We describe a process to overcome the challenges by replacing poor performing genes during platform transition and reducing the number of features without impacting classification performance. This approach assumes that a diagnostic panel reflects the effect of dysregulated biological processes associated with a disease, and genes involved in the same biological processes and coordinately affected by a disease share a similar discriminatory power. The utility of this optimization process was assessed using a published sepsis diagnostic panel. Substitution of more than half of the genes and/or reducing genes based on biological processes did not negatively affect the performance of the sepsis diagnostic panel. Our results suggest a systematic gene substitution and reduction process based on biological function can be used to alleviate the challenges associated with clinical development of biomarker panels

    Tissue-specific transcriptome profiling of the citrus fruit epidermis and subepidermis using laser capture microdissection

    Get PDF
    Most studies of the biochemical and regulatory pathways that are associated with, and control, fruit expansion and ripening are based on homogenized bulk tissues, and do not take into consideration the multiplicity of different cell types from which the analytes, be they transcripts, proteins or metabolites, are extracted. Consequently, potentially valuable spatial information is lost and the lower abundance cellular components that are expressed only in certain cell types can be diluted below the level of detection. In this study, laser microdissection (LMD) was used to isolate epidermal and subepidermal cells from green, expanding Citrus clementina fruit and their transcriptomes were compared using a 20k citrus cDNA microarray and quantitative real-time PCR. The results show striking differences in gene expression profiles between the two cell types, revealing specific metabolic pathways that can be related to their respective organelle composition and cell wall specialization. Microscopy provided additional evidence of tissue specialization that could be associated with the transcript profiles with distinct differences in organelle and metabolite accumulation. Subepidermis predominant genes are primarily involved in photosynthesis- and energy-related processes, as well as cell wall biosynthesis and restructuring. By contrast, the most epidermis predominant genes are related to the biosynthesis of the cuticle, flavonoids, and defence responses. Furthermore, the epidermis transcript profile showed a high proportion of genes with no known function, supporting the original hypothesis that analysis at the tissue/cell specific levels can promote gene discovery and lead to a better understanding of the specialized contribution of each tissue to fruit physiology
    corecore