27 research outputs found

    A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery

    Get PDF
    The accurate identification and quantitation of RNA isoforms present in the cancer transcriptome is key for analyses ranging from the inference of the impacts of somatic variants to pathway analysis to biomarker development and subtype discovery. The ICGC-TCGA DREAM Somatic Mutation Calling in RNA (SMC-RNA) challenge was a crowd-sourced effort to benchmark methods for RNA isoform quantification and fusion detection from bulk cancer RNA sequencing (RNA-seq) data. It concluded in 2018 with a comparison of 77 fusion detection entries and 65 isoform quantification entries on 51 synthetic tumors and 32 cell lines with spiked-in fusion constructs. We report the entries used to build this benchmark, the leaderboard results, and the experimental features associated with the accurate prediction of RNA species. This challenge required submissions to be in the form of containerized workflows, meaning each of the entries described is easily reusable through CWL and Docker containers at https://github.com/SMC-RNA-challenge. A record of this paper's transparent peer review process is included in the supplemental information

    Semantic knowledge graphs to understand tumor evolution and predict disease survival in cancer

    No full text
    Genomics technologies have generated large amounts of easily accessible biological -omics data, providing an unprecedented opportunity to study the mechanism in cancer. However, clinical research and the life sciences domain critically require to have a unified, integrated data model to facilitate the prognostic and diagnostic validation of biomarkers obtained. Knowledge graphs emerged as a promising solution based on our research for genomics and other -omics datasets. The primary reason to select a knowledge graph-based approach is that much data come from single cohorts such as TCGA, ICGC etc. that are carefully constructed to mitigate bias. Emerging datasets supporting the understanding of the complete mechanism are unstructured and in silos. The larger datasets such as TCGA and ICGC are patient cohorts and have issues ranging from patient self-selection, to confounding by indication, to limited knowledge of outcome data, and can therefore result in inadvertent bias if used alone for biomarker discovery. However the inclusion of molecular data such as CNV (copy number variation), DNA methylation, gene expression and mutation data (COSMIC, DoCM, MethylDB) adds additional and mechanistic features along with observational data from these cohorts. We applied semantic web and linked data approaches for knowledge graph embedding and federated networks for rapidly changing information in characterizing the disease, specifically cancer. The rapid change plays a critical role in disease progression and disease mechanism. Usually, these mechanisms are explained through biomarkers retrieved using comparative analysis of cancer stages with control for quantitative gene expression data. However, our knowledge graph facilitated including not only the quantitative data but also supportive molecular mechanisms to understand the change in pattern and associated factors. The cancer genomic events are layered processes and prediction models have to accommodate multi-omics data so that each molecular subtype feeds into incremental knowledge. This layered knowledge helps to improve the prediction of clinical outcomes, to elucidate the interplay between different levels and in disease modeling through layered data assembling. In our approach, we extrapolated knowledge graphs beyond the conventional knowledge enrichment and introduced a pattern mining approach to track the indicators in diseases such as cancer in a continuous way. We introduced the topological motif perturbations approach across disease stages to uncover the instances responsible for the change in the pattern, thus for disease progression, by continuous knowledge enrichment. Further, we applied a GCNN-based (graph convolution neural network) approach to identify the features required to not only track the disease mechanism but also predict survival and relapse in patients. We have customized the neural network in such a way that, while learning, we could customize the weight of each dataset or new concept added into the knowledge graph. The customized GCNN not only helped to predict the relapse accurately but also help to dichotomize the most relevant level of each dataset in cancer genomics. The above approach was tested and validated across various cancer types and contributed not only towards a novel way to integrate, understand and predict cancer but also added novel biomarker, e.g. the contribution of biomarkers in Gynecological cancers such as breast cancer, ovarian cancer, cervical cancer, and uterus cancer. The biomarkers retrieved through this approach contributed novel information about genes such as MYH7 involved in these cancers. We applied a motif-based pattern mining approach and established the relevant biomarkers to explain cancer progression mechanisms. Lastly, we developed a prediction model for breast and pancreatic cancer and developed clinical indicators. We also contributed adding COSMIC, TCGA and other RDF datasets into the linked open data (LOD) cloud with new enriched links from our knowledge graph: ``Oncology LOD\u27\u27

    Semantic knowledge graphs to understand tumor evolution and predict disease survival in cancer

    No full text
    Genomics technologies have generated large amounts of easily accessible biological -omics data, providing an unprecedented opportunity to study the mechanism in cancer. However, clinical research and the life sciences domain critically require to have a unified, integrated data model to facilitate the prognostic and diagnostic validation of biomarkers obtained. Knowledge graphs emerged as a promising solution based on our research for genomics and other -omics datasets. The primary reason to select a knowledge graph-based approach is that much data come from single cohorts such as TCGA, ICGC etc. that are carefully constructed to mitigate bias. Emerging datasets supporting the understanding of the complete mechanism are unstructured and in silos. The larger datasets such as TCGA and ICGC are patient cohorts and have issues ranging from patient self-selection, to confounding by indication, to limited knowledge of outcome data, and can therefore result in inadvertent bias if used alone for biomarker discovery. However the inclusion of molecular data such as CNV (copy number variation), DNA methylation, gene expression and mutation data (COSMIC, DoCM, MethylDB) adds additional and mechanistic features along with observational data from these cohorts. We applied semantic web and linked data approaches for knowledge graph embedding and federated networks for rapidly changing information in characterizing the disease, specifically cancer. The rapid change plays a critical role in disease progression and disease mechanism. Usually, these mechanisms are explained through biomarkers retrieved using comparative analysis of cancer stages with control for quantitative gene expression data. However, our knowledge graph facilitated including not only the quantitative data but also supportive molecular mechanisms to understand the change in pattern and associated factors. The cancer genomic events are layered processes and prediction models have to accommodate multi-omics data so that each molecular subtype feeds into incremental knowledge. This layered knowledge helps to improve the prediction of clinical outcomes, to elucidate the interplay between different levels and in disease modeling through layered data assembling. In our approach, we extrapolated knowledge graphs beyond the conventional knowledge enrichment and introduced a pattern mining approach to track the indicators in diseases such as cancer in a continuous way. We introduced the topological motif perturbations approach across disease stages to uncover the instances responsible for the change in the pattern, thus for disease progression, by continuous knowledge enrichment. Further, we applied a GCNN-based (graph convolution neural network) approach to identify the features required to not only track the disease mechanism but also predict survival and relapse in patients. We have customized the neural network in such a way that, while learning, we could customize the weight of each dataset or new concept added into the knowledge graph. The customized GCNN not only helped to predict the relapse accurately but also help to dichotomize the most relevant level of each dataset in cancer genomics. The above approach was tested and validated across various cancer types and contributed not only towards a novel way to integrate, understand and predict cancer but also added novel biomarker, e.g. the contribution of biomarkers in Gynecological cancers such as breast cancer, ovarian cancer, cervical cancer, and uterus cancer. The biomarkers retrieved through this approach contributed novel information about genes such as MYH7 involved in these cancers. We applied a motif-based pattern mining approach and established the relevant biomarkers to explain cancer progression mechanisms. Lastly, we developed a prediction model for breast and pancreatic cancer and developed clinical indicators. We also contributed adding COSMIC, TCGA and other RDF datasets into the linked open data (LOD) cloud with new enriched links from our knowledge graph: ``Oncology LOD''

    Semantic knowledge graphs to understand tumor evolution and predict disease survival in cancer

    Get PDF
    Genomics technologies have generated large amounts of easily accessible biological -omics data, providing an unprecedented opportunity to study the mechanism in cancer. However, clinical research and the life sciences domain critically require to have a unified, integrated data model to facilitate the prognostic and diagnostic validation of biomarkers obtained. Knowledge graphs emerged as a promising solution based on our research for genomics and other -omics datasets. The primary reason to select a knowledge graph-based approach is that much data come from single cohorts such as TCGA, ICGC etc. that are carefully constructed to mitigate bias. Emerging datasets supporting the understanding of the complete mechanism are unstructured and in silos. The larger datasets such as TCGA and ICGC are patient cohorts and have issues ranging from patient self-selection, to confounding by indication, to limited knowledge of outcome data, and can therefore result in inadvertent bias if used alone for biomarker discovery. However the inclusion of molecular data such as CNV (copy number variation), DNA methylation, gene expression and mutation data (COSMIC, DoCM, MethylDB) adds additional and mechanistic features along with observational data from these cohorts. We applied semantic web and linked data approaches for knowledge graph embedding and federated networks for rapidly changing information in characterizing the disease, specifically cancer. The rapid change plays a critical role in disease progression and disease mechanism. Usually, these mechanisms are explained through biomarkers retrieved using comparative analysis of cancer stages with control for quantitative gene expression data. However, our knowledge graph facilitated including not only the quantitative data but also supportive molecular mechanisms to understand the change in pattern and associated factors. The cancer genomic events are layered processes and prediction models have to accommodate multi-omics data so that each molecular subtype feeds into incremental knowledge. This layered knowledge helps to improve the prediction of clinical outcomes, to elucidate the interplay between different levels and in disease modeling through layered data assembling. In our approach, we extrapolated knowledge graphs beyond the conventional knowledge enrichment and introduced a pattern mining approach to track the indicators in diseases such as cancer in a continuous way. We introduced the topological motif perturbations approach across disease stages to uncover the instances responsible for the change in the pattern, thus for disease progression, by continuous knowledge enrichment. Further, we applied a GCNN-based (graph convolution neural network) approach to identify the features required to not only track the disease mechanism but also predict survival and relapse in patients. We have customized the neural network in such a way that, while learning, we could customize the weight of each dataset or new concept added into the knowledge graph. The customized GCNN not only helped to predict the relapse accurately but also help to dichotomize the most relevant level of each dataset in cancer genomics. The above approach was tested and validated across various cancer types and contributed not only towards a novel way to integrate, understand and predict cancer but also added novel biomarker, e.g. the contribution of biomarkers in Gynecological cancers such as breast cancer, ovarian cancer, cervical cancer, and uterus cancer. The biomarkers retrieved through this approach contributed novel information about genes such as MYH7 involved in these cancers. We applied a motif-based pattern mining approach and established the relevant biomarkers to explain cancer progression mechanisms. Lastly, we developed a prediction model for breast and pancreatic cancer and developed clinical indicators. We also contributed adding COSMIC, TCGA and other RDF datasets into the linked open data (LOD) cloud with new enriched links from our knowledge graph: ``Oncology LOD''

    Features’ compendium for machine learning in NGS data Analysis

    No full text
    BackgroundCurrent studies on the cancer genome, majorly involve use of next generation sequencing (NGS) technologies followed by data analysis pipelines. Many of these pipelines comprise of tools using machine learning algorithms especially for downstream analysis. Features are important components of machine learning systems and inclusion of informative features improves the accuracy of the machine learning algorithms. The algorithms used inNGS analysis leads to the generation of huge feature space.Sometimes, this high dimensionality leads to slower analysis time and lesser accuracy due to inherentbias of the model and/or redundancy of fewfeatures. With growth and interest in NGS studies, there has been a rapid development of new NGS analysis tools and improvement in the performance of the previous ones byincluding new features and excludingthe redundant ones. To enable these development, there is a dire need for standardizing this plethora of features available from literature.ResultsCurrent work presents a compendium of features that have been used in the literature for machine learning in NGS data pipeline and analysis. The features have beenfurther classified, assuming each stage of NGS data processing as individual category. The simple classification is a) Pre-processing features (b) Sequencing technology specific features (c) Downstream featuresor features for biological interpretation and analysis. This categorization will facilitate the use of correct features in a simplified manner.Conclusions The work will facilitate a uniform model for NGS tools development that utilize machine learning approaches for study of cancer data.A model for feature database and management based on this standardization is also proposed

    One size does not fit all: querying web polystores

    Get PDF
    Data retrieval systems are facing a paradigm shift due to the proliferation of specialized data storage engines (SQL, NoSQL, Column Stores, MapReduce, Data Stream, and Graph) supported by varied data models (CSV, JSON, RDB, RDF, and XML). One immediate consequence of this paradigm shift results into data bottleneck over the web; which means, web applications are unable to retrieve data with the intensity at which data are being generated from different facilities. Especially in the genomics and healthcare verticals, data are growing from petascale to exascale, and biomedical stakeholders are expecting seamless retrieval of these data over the web. In this paper, we argue that the bottleneck over the web can be reduced by minimizing the costly data conversion process and delegating query performance and processing loads to the specialized data storage engines over their native data models. We propose a web-based query federation mechanism—called PolyWeb—that unifies query answering over multiple native data models (CSV, RDB, and RDF). We emphasize two main challenges of query federation over native data models: 1) devise a method to select prospective data sources—with different underlying data models—that can satisfy a given query and 2) query optimization, join, and execution over different data models. We demonstrate PolyWeb on a cancer genomics use case, where it is often the case that a description of biological and chemical entities (e.g., gene, disease, drug, and pathways) spans across multiple data models and respective storage engines. In order to assess the benefits and limitations of evaluating queries over native data models, we evaluate PolyWeb with the state-of-the-art query federation engines in terms of result completeness, source selection, and overall query execution time.peer-reviewe

    Deep convolution neural network model to predict relapse in breast cancer

    Get PDF
    A mishap in anti-cancer drug distribution is critical in breast cancer patients due to poor prediction model to identify the treatment regime in ER+ve and ER-ve (Estrogen Receptor (ER)) patients. The traditional method for the prediction depends on the change in expression across the normal-disease pair. However, it certainly misses the multidimensional aspect and underlying cause of relapse, such as various mutations, drug dosage side effects, methylation, etc. In this paper, we have developed a multi-layer neural network model to classify multidimensional genomics data into their similar annotation group. Further, we used this multi-layer cancer genomics perceptron for annotating differentially expressed genes (DEGs) to predict relapse based on ER status in breast cancer. This approach provides multivariate identification of genes, not just by differential expression, but, cause-effect of disease status due to drug overdosage and genomics-driven drug balancing method. The multi-layered neural network model, where each layer defines the relationship of similar databases with multidimensional knowledge. We illustrate that the use of multilayer knowledge graph with gene expression data for training the deep convolution neural network stratify the patient relapse and drug dosage along with underlying molecular properties.This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289, co-funded by the European Regional Development Fundnon-peer-reviewe

    Efficient distributed path computation on RDF knowledge graphs using partial evaluation

    No full text
    International audienceA key property of Linked Data is the representation and publication of data as interconnected labelled graphs where different resources linked to each other form a network of meaningful information. Searching these important relationships between resources – within single or distributed graphs – can be reduced to a pathfinding or navigation problem, i.e., looking for chains of intermediate nodes. SPARQL1.1, the current standard query language for RDF-based Linked Data defines a construct – called Property Paths (PPs) – to navigate between entities within a single graph. Since Linked Data technologies are naturally aimed at decentralised scenarios, there are many cases where centralising this data is not feasible or even not possible for querying purposes. To address these problems, we propose a SPARQL PP-based graph processing approach – dubbed DpcLD – where users can execute SPARQL PP queries and find paths distributed across multiple, connected graphs exposed as SPARQL endpoints. To execute the distributed path queries we implemented an index-free, cache-based query engine that communicates with a shared algorithm running on each remote endpoint, and computes the distributed paths. In this paper, we highlight the way in which this approach exploits and aggregates partial paths, within a distributed environment, to produce complete results. We perform extensive experiments to demonstrate the performance of our approach on two datasets: One representing 10 million triples from the DBPedia SPARQL benchmark, and another full benchmark dataset corresponding to 124 million triples. We also perform a scalability test of our approach using real-world genomics datasets distributed across multiple endpoints. We compare our distributed approach with other distributed and centralized pathfinding approaches, showing that it outperforms other distributed approaches by orders of magnitude, and provides a good trade-off for cases when the data cannot be centralised
    corecore