19 research outputs found

    Semantic knowledge graphs to understand tumor evolution and predict disease survival in cancer

    No full text
    Genomics technologies have generated large amounts of easily accessible biological -omics data, providing an unprecedented opportunity to study the mechanism in cancer. However, clinical research and the life sciences domain critically require to have a unified, integrated data model to facilitate the prognostic and diagnostic validation of biomarkers obtained. Knowledge graphs emerged as a promising solution based on our research for genomics and other -omics datasets. The primary reason to select a knowledge graph-based approach is that much data come from single cohorts such as TCGA, ICGC etc. that are carefully constructed to mitigate bias. Emerging datasets supporting the understanding of the complete mechanism are unstructured and in silos. The larger datasets such as TCGA and ICGC are patient cohorts and have issues ranging from patient self-selection, to confounding by indication, to limited knowledge of outcome data, and can therefore result in inadvertent bias if used alone for biomarker discovery. However the inclusion of molecular data such as CNV (copy number variation), DNA methylation, gene expression and mutation data (COSMIC, DoCM, MethylDB) adds additional and mechanistic features along with observational data from these cohorts. We applied semantic web and linked data approaches for knowledge graph embedding and federated networks for rapidly changing information in characterizing the disease, specifically cancer. The rapid change plays a critical role in disease progression and disease mechanism. Usually, these mechanisms are explained through biomarkers retrieved using comparative analysis of cancer stages with control for quantitative gene expression data. However, our knowledge graph facilitated including not only the quantitative data but also supportive molecular mechanisms to understand the change in pattern and associated factors. The cancer genomic events are layered processes and prediction models have to accommodate multi-omics data so that each molecular subtype feeds into incremental knowledge. This layered knowledge helps to improve the prediction of clinical outcomes, to elucidate the interplay between different levels and in disease modeling through layered data assembling. In our approach, we extrapolated knowledge graphs beyond the conventional knowledge enrichment and introduced a pattern mining approach to track the indicators in diseases such as cancer in a continuous way. We introduced the topological motif perturbations approach across disease stages to uncover the instances responsible for the change in the pattern, thus for disease progression, by continuous knowledge enrichment. Further, we applied a GCNN-based (graph convolution neural network) approach to identify the features required to not only track the disease mechanism but also predict survival and relapse in patients. We have customized the neural network in such a way that, while learning, we could customize the weight of each dataset or new concept added into the knowledge graph. The customized GCNN not only helped to predict the relapse accurately but also help to dichotomize the most relevant level of each dataset in cancer genomics. The above approach was tested and validated across various cancer types and contributed not only towards a novel way to integrate, understand and predict cancer but also added novel biomarker, e.g. the contribution of biomarkers in Gynecological cancers such as breast cancer, ovarian cancer, cervical cancer, and uterus cancer. The biomarkers retrieved through this approach contributed novel information about genes such as MYH7 involved in these cancers. We applied a motif-based pattern mining approach and established the relevant biomarkers to explain cancer progression mechanisms. Lastly, we developed a prediction model for breast and pancreatic cancer and developed clinical indicators. We also contributed adding COSMIC, TCGA and other RDF datasets into the linked open data (LOD) cloud with new enriched links from our knowledge graph: ``Oncology LOD\u27\u27

    Semantic knowledge graphs to understand tumor evolution and predict disease survival in cancer

    Get PDF
    Genomics technologies have generated large amounts of easily accessible biological -omics data, providing an unprecedented opportunity to study the mechanism in cancer. However, clinical research and the life sciences domain critically require to have a unified, integrated data model to facilitate the prognostic and diagnostic validation of biomarkers obtained. Knowledge graphs emerged as a promising solution based on our research for genomics and other -omics datasets. The primary reason to select a knowledge graph-based approach is that much data come from single cohorts such as TCGA, ICGC etc. that are carefully constructed to mitigate bias. Emerging datasets supporting the understanding of the complete mechanism are unstructured and in silos. The larger datasets such as TCGA and ICGC are patient cohorts and have issues ranging from patient self-selection, to confounding by indication, to limited knowledge of outcome data, and can therefore result in inadvertent bias if used alone for biomarker discovery. However the inclusion of molecular data such as CNV (copy number variation), DNA methylation, gene expression and mutation data (COSMIC, DoCM, MethylDB) adds additional and mechanistic features along with observational data from these cohorts. We applied semantic web and linked data approaches for knowledge graph embedding and federated networks for rapidly changing information in characterizing the disease, specifically cancer. The rapid change plays a critical role in disease progression and disease mechanism. Usually, these mechanisms are explained through biomarkers retrieved using comparative analysis of cancer stages with control for quantitative gene expression data. However, our knowledge graph facilitated including not only the quantitative data but also supportive molecular mechanisms to understand the change in pattern and associated factors. The cancer genomic events are layered processes and prediction models have to accommodate multi-omics data so that each molecular subtype feeds into incremental knowledge. This layered knowledge helps to improve the prediction of clinical outcomes, to elucidate the interplay between different levels and in disease modeling through layered data assembling. In our approach, we extrapolated knowledge graphs beyond the conventional knowledge enrichment and introduced a pattern mining approach to track the indicators in diseases such as cancer in a continuous way. We introduced the topological motif perturbations approach across disease stages to uncover the instances responsible for the change in the pattern, thus for disease progression, by continuous knowledge enrichment. Further, we applied a GCNN-based (graph convolution neural network) approach to identify the features required to not only track the disease mechanism but also predict survival and relapse in patients. We have customized the neural network in such a way that, while learning, we could customize the weight of each dataset or new concept added into the knowledge graph. The customized GCNN not only helped to predict the relapse accurately but also help to dichotomize the most relevant level of each dataset in cancer genomics. The above approach was tested and validated across various cancer types and contributed not only towards a novel way to integrate, understand and predict cancer but also added novel biomarker, e.g. the contribution of biomarkers in Gynecological cancers such as breast cancer, ovarian cancer, cervical cancer, and uterus cancer. The biomarkers retrieved through this approach contributed novel information about genes such as MYH7 involved in these cancers. We applied a motif-based pattern mining approach and established the relevant biomarkers to explain cancer progression mechanisms. Lastly, we developed a prediction model for breast and pancreatic cancer and developed clinical indicators. We also contributed adding COSMIC, TCGA and other RDF datasets into the linked open data (LOD) cloud with new enriched links from our knowledge graph: ``Oncology LOD''

    Features’ compendium for machine learning in NGS data Analysis

    No full text
    BackgroundCurrent studies on the cancer genome, majorly involve use of next generation sequencing (NGS) technologies followed by data analysis pipelines. Many of these pipelines comprise of tools using machine learning algorithms especially for downstream analysis. Features are important components of machine learning systems and inclusion of informative features improves the accuracy of the machine learning algorithms. The algorithms used inNGS analysis leads to the generation of huge feature space.Sometimes, this high dimensionality leads to slower analysis time and lesser accuracy due to inherentbias of the model and/or redundancy of fewfeatures. With growth and interest in NGS studies, there has been a rapid development of new NGS analysis tools and improvement in the performance of the previous ones byincluding new features and excludingthe redundant ones. To enable these development, there is a dire need for standardizing this plethora of features available from literature.ResultsCurrent work presents a compendium of features that have been used in the literature for machine learning in NGS data pipeline and analysis. The features have beenfurther classified, assuming each stage of NGS data processing as individual category. The simple classification is a) Pre-processing features (b) Sequencing technology specific features (c) Downstream featuresor features for biological interpretation and analysis. This categorization will facilitate the use of correct features in a simplified manner.Conclusions The work will facilitate a uniform model for NGS tools development that utilize machine learning approaches for study of cancer data.A model for feature database and management based on this standardization is also proposed

    One size does not fit all: querying web polystores

    Get PDF
    Data retrieval systems are facing a paradigm shift due to the proliferation of specialized data storage engines (SQL, NoSQL, Column Stores, MapReduce, Data Stream, and Graph) supported by varied data models (CSV, JSON, RDB, RDF, and XML). One immediate consequence of this paradigm shift results into data bottleneck over the web; which means, web applications are unable to retrieve data with the intensity at which data are being generated from different facilities. Especially in the genomics and healthcare verticals, data are growing from petascale to exascale, and biomedical stakeholders are expecting seamless retrieval of these data over the web. In this paper, we argue that the bottleneck over the web can be reduced by minimizing the costly data conversion process and delegating query performance and processing loads to the specialized data storage engines over their native data models. We propose a web-based query federation mechanism—called PolyWeb—that unifies query answering over multiple native data models (CSV, RDB, and RDF). We emphasize two main challenges of query federation over native data models: 1) devise a method to select prospective data sources—with different underlying data models—that can satisfy a given query and 2) query optimization, join, and execution over different data models. We demonstrate PolyWeb on a cancer genomics use case, where it is often the case that a description of biological and chemical entities (e.g., gene, disease, drug, and pathways) spans across multiple data models and respective storage engines. In order to assess the benefits and limitations of evaluating queries over native data models, we evaluate PolyWeb with the state-of-the-art query federation engines in terms of result completeness, source selection, and overall query execution time.peer-reviewe

    Querying web polystores

    Get PDF
    International audienceThe database, semantic web, and linked data communities have proposed solutions that federate queries over multiple data sources using a single data model. Nowadays, the data retrieval requirements originating from versatile and broad domains like healthcare and life sciences (HCLS) are changing this conventional trend - of federating query over a single data model - primarily due to the simultaneous use of different data models (CSV, JSON, RDB, RDF, XML, etc.) in a real-life scenario. It's now impractical to assume that the variety (graph, key-value, stream, text, table, tree, etc.) of high volume data residing in specialised storage engines will first be converted to a common data model, stored in a general-purpose data storage engine, and finally be queried over the Web. Nevertheless, in this era where genomics datasets are growing from petascale to exascale, it is now important to exploit such vast domain resources in their native data models. The key approach is to query the vast data resources from their native data models and specialised storage engines. In this paper, we propose a Web-based query federation mechanism - called PolyWeb - that unifies query answering over multiple native data models (CSV, RDB, and RDF). We demonstrate PolyWeb on a cancer genomics use-case where it is often the case that a description of biological and chemical entities (e.g., gene, disease, drug, pathways) span across multiple data models. In order to assess the benefits and limitations of evaluating queries over native data models, we evaluate PolyWeb with state-of-the-art query federation engine in terms of result completeness, source selection, and overall query execution time

    Deep convolution neural network model to predict relapse in breast cancer

    Get PDF
    A mishap in anti-cancer drug distribution is critical in breast cancer patients due to poor prediction model to identify the treatment regime in ER+ve and ER-ve (Estrogen Receptor (ER)) patients. The traditional method for the prediction depends on the change in expression across the normal-disease pair. However, it certainly misses the multidimensional aspect and underlying cause of relapse, such as various mutations, drug dosage side effects, methylation, etc. In this paper, we have developed a multi-layer neural network model to classify multidimensional genomics data into their similar annotation group. Further, we used this multi-layer cancer genomics perceptron for annotating differentially expressed genes (DEGs) to predict relapse based on ER status in breast cancer. This approach provides multivariate identification of genes, not just by differential expression, but, cause-effect of disease status due to drug overdosage and genomics-driven drug balancing method. The multi-layered neural network model, where each layer defines the relationship of similar databases with multidimensional knowledge. We illustrate that the use of multilayer knowledge graph with gene expression data for training the deep convolution neural network stratify the patient relapse and drug dosage along with underlying molecular properties.This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289, co-funded by the European Regional Development Fun

    Extending inner-ear anatomical concepts in the Foundational Model of Anatomy (FMA) ontology

    Get PDF
    The inner ear is physically inaccessible in living humans, which leads to unique difficulties in studying its normal function and pathology as in other human organs. Recently, biosimulation model has gained a significant attention to understand the exact causative factors that give rise to impairment in human organs. However, to build a biosimulation model for human organ concepts and their topological relationships from multiple and semantically overlapping domains such as biology, anatomy, geometrical, mathematical, physical models are required. In this paper, we focus on modelling the inner-ear macro anatomical concepts and their topological relationships. We extended the Foundational Model of Anatomy (FMA) ontology to cover micro-level version of human innerear anatomy where connection between simulating tissues, liquids, soft tissues and connecting adjacent (e.g. hair cells, perilymph) parts studied in detail, included and implementedThis publication has emanated from research supported in part by the research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 and EU project SIFEM (contract Number 600933)
    corecore