244 research outputs found

    Knowledge-based Biomedical Data Science 2019

    Full text link
    Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages with 3 table

    Discovering Complex Relationships between Drugs and Diseases

    Get PDF
    Finding the complex semantic relations between existing drugs and new diseases will help in the drug development in a new way. Most of the drugs which have found new uses have been discovered due to serendipity. Hence, the prediction of the uses of drugs for more than one disease should be done in a systematic way by studying the semantic relations between the drugs and diseases and also the other entities involved in the relations. Hence, in order to study the complex semantic relations between drugs and diseases an application was developed that integrates the heterogeneous data in different formats from different public databases which are available online. A high level ontology was also developed to integrate the data and only the fields required for the current study were used. The data was collected from different data sources such as DrugBank, UniProt/SwissProt, GeneCards and OMIM. Most of these data sources are the standard data sources and have been used by National Committee of Biotechnology Information of Nation Institute of Health. The data was parsed and integrated and complex semantic relations were discovered. This is a simple and novel effort which may find uses in development of new drug targets and polypharmacology

    Drug prioritization using the semantic properties of a knowledge graph

    Get PDF
    Abstract Compounds that are candidates for drug repurposing can be ranked by leveraging knowledge available in the biomedical literature and databases. This knowledge, spread across a variety of sources, can be integrated within a knowledge graph, which thereby comprehensively describes known relationships between biomedical concepts, such as drugs, diseases, genes, etc. Our work uses the semantic information between drug and disease concepts as features, which are extracted from an existing knowledge graph that integrates 200 different biological knowledge sources. RepoDB, a standard drug repurposing database which describes drug-disease combinations that were approved or that failed in clinical trials, is used to train a random forest classifier. The 10-times repeated 10-fold cross-validation performance of the classifier achieves a mean area under the receiver operating characteristic curve (AUC) of 92.2%. We apply the classifier to prioritize 21 preclinical drug repurposing candidates that have been suggested for Autosomal Dominant Polycystic Kidney Disease (ADPKD). Mozavaptan, a vasopressin V2 receptor antagonist is predicted to be the drug most likely to be approved after a clinical trial, and belongs to the same drug class as tolvaptan, the only treatment for ADPKD that is currently approved. We conclude that semantic properties of concepts in a knowledge graph can be exploited to prioritize drug repurposing candidates for testing in clinical trials

    Using the Literature to Identify Confounders

    Get PDF
    Prior work in causal modeling has focused primarily on learning graph structures and parameters to model data generating processes from observational or experimental data, while the focus of the literature-based discovery paradigm was to identify novel therapeutic hypotheses in publicly available knowledge. The critical contribution of this dissertation is to refashion the literature-based discovery paradigm as a means to populate causal models with relevant covariates to abet causal inference. In particular, this dissertation describes a generalizable framework for mapping from causal propositions in the literature to subgraphs populated by instantiated variables that reflect observational data. The observational data are those derived from electronic health records. The purpose of causal inference is to detect adverse drug event signals. The Principle of the Common Cause is exploited as a heuristic for a defeasible practical logic. The fundamental intuition is that improbable co-occurrences can be “explained away” with reference to a common cause, or confounder. Semantic constraints in literature-based discovery can be leveraged to identify such covariates. Further, the asymmetric semantic constraints of causal propositions map directly to the topology of causal graphs as directed edges. The hypothesis is that causal models conditioned on sets of such covariates will improve upon the performance of purely statistical techniques for detecting adverse drug event signals. By improving upon previous work in purely EHR-based pharmacovigilance, these results establish the utility of this scalable approach to automated causal inference

    Data-driven knowledge discovery in polycystic kidney disease

    Get PDF
    The use of data derived from genomics and transcriptomic to further develop our understanding of Polycystic Kidney Diseases and identify novel drugs for its treatment.LUMC / Geneeskund

    Fast Machine Learning Algorithms for Massive Datasets with Applications in the Biomedical Domain

    Get PDF
    The continuous increase in the size of datasets introduces computational challenges for machine learning algorithms. In this dissertation, we cover the machine learning algorithms and applications in large-scale data analysis in manufacturing and healthcare. We begin with introducing a multilevel framework to scale the support vector machine (SVM), a popular supervised learning algorithm with a few tunable hyperparameters and highly accurate prediction. The computational complexity of nonlinear SVM is prohibitive on large-scale datasets compared to the linear SVM, which is more scalable for massive datasets. The nonlinear SVM has shown to produce significantly higher classification quality on complex and highly imbalanced datasets. However, a higher classification quality requires a computationally expensive quadratic programming solver and extra kernel parameters for model selection. We introduce a generalized fast multilevel framework for regular, weighted, and instance weighted SVM that achieves similar or better classification quality compared to the state-of-the-art SVM libraries such as LIBSVM. Our framework improves the runtime more than two orders of magnitude for some of the well-known benchmark datasets. We cover multiple versions of our proposed framework and its implementation in detail. The framework is implemented using PETSc library which allows easy integration with scientific computing tasks. Next, we propose an adaptive multilevel learning framework for SVM to reduce the variance between prediction qualities across the levels, improve the overall prediction accuracy, and boost the runtime. We implement multi-threaded support to speed up the parameter fitting runtime that results in more than an order of magnitude speed-up. We design an early stopping criteria to reduce the extra computational cost when we achieve expected prediction quality. This approach provides significant speed-up, especially for massive datasets. Finally, we propose an efficient low dimensional feature extraction over massive knowledge networks. Knowledge networks are becoming more popular in the biomedical domain for knowledge representation. Each layer in knowledge networks can store the information from one or multiple sources of data. The relationships between concepts or between layers represent valuable information. The proposed feature engineering approach provides an efficient and highly accurate prediction of the relationship between biomedical concepts on massive datasets. Our proposed approach utilizes semantics and probabilities to reduce the potential search space for the exploration and learning of machine learning algorithms. The calculation of probabilities is highly scalable with the size of the knowledge network. The number of features is fixed and equivalent to the number of relationships or classes in the data. A comprehensive comparison of well-known classifiers such as random forest, SVM, and deep learning over various features extracted from the same dataset, provides an overview for performance and computational trade-offs. Our source code, documentation and parameters will be available at https://github.com/esadr/

    Reproducibility and Replicability in Unmanned Aircraft Systems and Geographic Information Science

    Get PDF
    Multiple scientific disciplines face a so-called crisis of reproducibility and replicability (R&R) in which the validity of methodologies is questioned due to an inability to confirm experimental results. Trust in information technology (IT)-intensive workflows within geographic information science (GIScience), remote sensing, and photogrammetry depends on solutions to R&R challenges affecting multiple computationally driven disciplines. To date, there have only been very limited efforts to overcome R&R-related issues in remote sensing workflows in general, let alone those tied to disruptive technologies such as unmanned aircraft systems (UAS) and machine learning (ML). To accelerate an understanding of this crisis, a review was conducted to identify the issues preventing R&R in GIScience. Key barriers included: (1) awareness of time and resource requirements, (2) accessibility of provenance, metadata, and version control, (3) conceptualization of geographic problems, and (4) geographic variability between study areas. As a case study, a replication of a GIScience workflow utilizing Yolov3 algorithms to identify objects in UAS imagery was attempted. Despite the ability to access source data and workflow steps, it was discovered that the lack of accessibility to provenance and metadata of each small step of the work prohibited the ability to successfully replicate the work. Finally, a novel method for provenance generation was proposed to address these issues. It was found that artificial intelligence (AI) could be used to quickly create robust provenance records for workflows that do not exceed time and resource constraints and provide the information needed to replicate work. Such information can bolster trust in scientific results and provide access to cutting edge technology that can improve everyday life

    Reproducibility and Replicability in Unmanned Aircraft Systems and Geographic Information Science

    Get PDF
    Multiple scientific disciplines face a so-called crisis of reproducibility and replicability (R&R) in which the validity of methodologies is questioned due to an inability to confirm experimental results. Trust in information technology (IT)-intensive workflows within geographic information science (GIScience), remote sensing, and photogrammetry depends on solutions to R&R challenges affecting multiple computationally driven disciplines. To date, there have only been very limited efforts to overcome R&R-related issues in remote sensing workflows in general, let alone those tied to disruptive technologies such as unmanned aircraft systems (UAS) and machine learning (ML). To accelerate an understanding of this crisis, a review was conducted to identify the issues preventing R&R in GIScience. Key barriers included: (1) awareness of time and resource requirements, (2) accessibility of provenance, metadata, and version control, (3) conceptualization of geographic problems, and (4) geographic variability between study areas. As a case study, a replication of a GIScience workflow utilizing Yolov3 algorithms to identify objects in UAS imagery was attempted. Despite the ability to access source data and workflow steps, it was discovered that the lack of accessibility to provenance and metadata of each small step of the work prohibited the ability to successfully replicate the work. Finally, a novel method for provenance generation was proposed to address these issues. It was found that artificial intelligence (AI) could be used to quickly create robust provenance records for workflows that do not exceed time and resource constraints and provide the information needed to replicate work. Such information can bolster trust in scientific results and provide access to cutting edge technology that can improve everyday life
    • …
    corecore