38 research outputs found
Systems approaches to drug repositioning
PhD ThesisDrug discovery has overall become less fruitful and more costly, despite vastly increased
biomedical knowledge and evolving approaches to Research and Development (R&D).
One complementary approach to drug discovery is that of drug repositioning which
focusses on identifying novel uses for existing drugs. By focussing on existing drugs
that have already reached the market, drug repositioning has the potential to both
reduce the timeframe and cost of getting a disease treatment to those that need it.
Many marketed examples of repositioned drugs have been found via serendipitous or
rational observations, highlighting the need for more systematic methodologies.
Systems approaches have the potential to enable the development of novel methods to
understand the action of therapeutic compounds, but require an integrative approach
to biological data. Integrated networks can facilitate systems-level analyses by combining
multiple sources of evidence to provide a rich description of drugs, their targets and
their interactions. Classically, such networks can be mined manually where a skilled
person can identify portions of the graph that are indicative of relationships between
drugs and highlight possible repositioning opportunities. However, this approach is
not scalable. Automated procedures are required to mine integrated networks systematically
for these subgraphs and bring them to the attention of the user. The aim
of this project was the development of novel computational methods to identify new
therapeutic uses for existing drugs (with particular focus on active small molecules)
using data integration.
A framework for integrating disparate data relevant to drug repositioning, Drug Repositioning
Network Integration Framework (DReNInF) was developed as part of this
work. This framework includes a high-level ontology, Drug Repositioning Network
Integration Ontology (DReNInO), to aid integration and subsequent mining; a suite
of parsers; and a generic semantic graph integration platform. This framework enables
the production of integrated networks maintaining strict semantics that are important
in, but not exclusive to, drug repositioning. The DReNInF is then used to create Drug Repositioning Network Integration (DReNIn), a semantically-rich Resource Description
Framework (RDF) dataset. A Web-based front end was developed, which includes
a SPARQL Protocol and RDF Query Language (SPARQL) endpoint for querying this
dataset.
To automate the mining of drug repositioning datasets, a formal framework for the
definition of semantic subgraphs was established and a method for Drug Repositioning
Semantic Mining (DReSMin) was developed. DReSMin is an algorithm for mining
semantically-rich networks for occurrences of a given semantic subgraph. This algorithm
allows instances of complex semantic subgraphs that contain data about putative
drug repositioning opportunities to be identified in a computationally tractable
fashion, scaling close to linearly with network data.
The ability of DReSMin to identify novel Drug-Target (D-T) associations was investigated.
9,643,061 putative D-T interactions were identified and ranked, with a strong
correlation between highly scored associations and those supported by literature observed.
The 20 top ranked associations were analysed in more detail with 14 found
to be novel and six found to be supported by the literature. It was also shown that
this approach better prioritises known D-T interactions, than other state-of-the-art
methodologies.
The ability of DReSMin to identify novel Drug-Disease (Dr-D) indications was also
investigated. As target-based approaches are utilised heavily in the field of drug discovery,
it is necessary to have a systematic method to rank Gene-Disease (G-D) associations.
Although methods already exist to collect, integrate and score these associations,
these scores are often not a reliable re
flection of expert knowledge. Therefore, an
integrated data-driven approach to drug repositioning was developed using a Bayesian
statistics approach and applied to rank 309,885 G-D associations using existing knowledge.
Ranked associations were then integrated with other biological data to produce
a semantically-rich drug discovery network. Using this network it was shown that
diseases of the central nervous system (CNS) provide an area of interest. The network
was then systematically mined for semantic subgraphs that capture novel Dr-D relations.
275,934 Dr-D associations were identified and ranked, with those more likely to
be side-effects filtered. Work presented here includes novel tools and algorithms to enable research within
the field of drug repositioning. DReNIn, for example, includes data that previous
comparable datasets relevant to drug repositioning have neglected, such as clinical
trial data and drug indications. Furthermore, the dataset may be easily extended
using DReNInF to include future data as and when it becomes available, such as G-D
association directionality (i.e. is the mutation a loss-of-function or gain-of-function).
Unlike other algorithms and approaches developed for drug repositioning, DReSMin
can be used to infer any types of associations captured in the target semantic network.
Moreover, the approaches presented here should be more generically applicable to
other fields that require algorithms for the integration and mining of semantically rich
networks.European and Physical Sciences Research Council (EPSRC) and GS
Recommended from our members
Ontology-based Semantic Harmonization of HIV-associated Common Data Elements for Integration of Diverse HIV Research Datasets
Analysis of integrated, diverse, Human Immunodeficiency Virus (HIV)-associated datasets can increase knowledge and guide the development of novel and effective interventions for disease prevention and treatment by increasing breadth of variables and statistical power, particularly for sub-group analyses. This topic has been identified as a National Institutes of Health research priority, but few efforts have been made to integrate data across HIV studies. Our aims were to: 1) Characterize the semantic heterogeneity (SH) in the HIV research domain; 2) Identify HIV-associated common data elements (CDEs) in empirically generated and knowledge-based resources; 3) Create a formal representation of HIV-associated CDEs in the form of an HIV-associated Entities in Research Ontology (HERO); 4) Assess the feasibility of using HERO to semantically harmonize HIV research data. Our approach was guided by information/knowledge theory and the DIKW (Data Information Knowledge Wisdom) hierarchical model.
Our systematized review of the literature revealed that synergistic use of both ontologies and CDEs included integration, interoperability, data exchange, and data standardization. Moreover, methods and tools included use of experts for CDE identification, the Unified Medical Language System, natural language processing, Extensible Markup Language, Health Level 7, and ontology development tools (e.g., Protégé). Additionally, evaluation methods included expert assessment, quantification of mapping tasks between raters, assessment of interrater reliability, and comparison to established standards. We used these findings to inform our process for achieving the study aims.
For Aim 1, we analyzed eight disparate HIV-associated data dictionaries and developed a String Metric-assisted Assessment of Semantic Heterogeneity (SMASH) method, which aided identification of 127 (13%) homogeneous data element (DE) pairs and 1,048 (87%) semantically heterogeneous DE pairs. Most heterogeneous pairs (97%) were semantically-equivalent/syntactically-different, allowing us to determine that SH in the HIV research domain was high.
To achieve Aim 2, we used Clinicaltrials.gov, Google Search, and text mining in R to identify HIV-associated CDEs in HIV journal articles, HIV-associated datasets, AIDSinfo HIV/AIDS Glossary, AIDSinfo Drug Database, Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), and RxNORM (understood as prescription normalization). Two HIV experts then manually reviewed DEs from the journal articles and data dictionaries to confirm DE commonality and resolved semantic discrepancies through discussion. Ultimately, we identified 2,179 unique CDEs. Of all CDEs, data-driven approaches identified 2,055 (94%) (999 from the HIV/AIDS Glossary, 398 from the Drug Database, 91 from journal articles, and a total of 567 from LOINC, SNOMED, and RxNorm cumulatively). Expert-based approaches identified 124 (6%) unique CDEs from data dictionaries and confirmed the 91 CDEs from journal articles.
In Aim 3, we used the Protégé suite of ontology development tools and the 2,179 CDEs to develop the HERO. We modeled the ontology using the semantic structure of the Medical Entities Dictionary, available hierarchical information from the CDE knowledge resources, and expert knowledge. The ontology fulfilled most relevant criteria from Cimino’s desiderata and OntoClean ontology engineering principles, and it successfully answered eight competency questions.
Finally, for Aim 4, we assessed the feasibility of using HERO to semantically harmonize and integrate the data dictionaries from two diverse HIV-associated datasets. Two HIV experts involved in the development of HERO independently assessed each data dictionary. Of the 367 DEs in data dictionary 1 (D1), 181 (49.32%) were identified as CDEs and 186 (50.68%) were not CDEs, and of the 72 DEs in data dictionary 2 (D2), 37 (51.39%) were CDEs and 35 (48.61%) were not CDEs. The HIV experts then traversed HERO’s hierarchy to map CDEs from D1 and D2 to CDEs in HERO. Of the 181 CDEs in D1, 156 (86.19%) were found in HERO, and 25 (13.81%) were not. Similarly, of the 37 CDEs in D2 32 (86.48%) were found in HERO, and 5 (13.51%) were not. Interrater reliability for CDE identification as measured by Cohen’s Kappa was 0.900 for D1 and 0.892 for D2. Cohen’s Kappas for CDEs in D1 and D2 that were also identified in HERO were 0.885 and 0.688, respectively.
Subsequently, to demonstrate the integration of the two HIV-associated datasets, a sample of semantically harmonized CDEs in both datasets was categorically selected (e.g. administrative, demographic, and behavioral), and D2 sample size increases were calculated for race (e.g., White, African American/Black, Asian/Pacific Islander, Native American/Indian, and Hispanic/Latino) and for “intravenous drug use” from the integrated datasets. The average increase of D2 CDEs for six selected CDEs was 1,928%.
Despite the limitation of HERO developers also serving as evaluators, the contributions of the study to the fields of informatics and HIV research were substantial. Confirmatory contributions include: identification of effective CDE/ontology tools, and use of data-driven and expert-based methods. Novel contributions include: development of SMASH and HERO; and new contributions include documenting that SH is high in HIV-associated datasets, identifying 2,179 HIV-associated CDEs, creating two additional classifications of SH, and showing that using HERO for semantic harmonization of HIV-associated data dictionaries is feasible. Our future work will build upon this research by expanding the numbers and types of datasets, refining our methods and tools, and conducting an external evaluation
Recommended from our members
Leveraging Knowledge-Based Approaches to Promote Antiretroviral Toxicity Monitoring in Underserved Settings
As access and use of antiretroviral therapy continue to increase, the need to improve antiretroviral toxicity monitoring becomes more critical. This is particularly so in underserved settings, where patterns of antiretroviral toxicities possibly alter the need for and frequency of antiretroviral toxicity monitoring. However, barriers such as few skilled healthcare providers and poor infrastructure make antiretroviral toxicity monitoring in underserved settings difficult. The purpose of this dissertation was to investigate how standard clinical guidelines, knowledge-based clinical decision support, and task delegation could be leveraged to overcome barriers to antiretroviral toxicity monitoring in underserved settings.
The strategy adopted in this dissertation was guided by the Design Science Research Methodology that emphasizes the generation of scientific knowledge through building novel artifacts. Two qualitative descriptive studies were conducted to characterize the contextual factors associated with antiretroviral toxicity monitoring in underserved settings. Supported by the findings from these studies, a knowledge-based software application prototype that implements clinical practice guidelines for antiretroviral toxicity monitoring was developed. Next, a quantitative validation study was used to evaluate the structure and behavior of the prototype’s knowledge base. Lastly, a quantitative usability study was conducted to assess lay health worker perceptions of the satisfaction and mental effort associated with the use of checklists generated by the prototype.
This dissertation research produced empirical evidence about the broad motives and strategies for promoting medication adherence, safety, and effectiveness in underserved settings. It also identified barriers and facilitators of antiretroviral toxicity monitoring within ambulatory HIV care workflows in underserved settings. Additionally, it provided evidence about the extent to which antiretroviral toxicity domain knowledge could be implemented in a knowledge-based application for supporting point-of-care antiretroviral toxicity monitoring. Lastly, the research provided previously unavailable empirical evidence about the perceptions of lay peer health workers on the use of checklists for the documentation of antiretroviral toxicities
OM-2017: Proceedings of the Twelfth International Workshop on Ontology Matching
shvaiko2017aInternational audienceOntology matching is a key interoperability enabler for the semantic web, as well as auseful tactic in some classical data integration tasks dealing with the semantic heterogeneityproblem. It takes ontologies as input and determines as output an alignment,that is, a set of correspondences between the semantically related entities of those ontologies.These correspondences can be used for various tasks, such as ontology merging,data translation, query answering or navigation on the web of data. Thus, matchingontologies enables the knowledge and data expressed with the matched ontologies tointeroperate
Towards a system of concepts for Family Medicine. Multilingual indexing in General Practice/ Family Medicine in the era of Semantic Web
UNIVERSITY OF LIÈGE, BELGIUM
Executive Summary
Faculty of Medicine
Département Universitaire de Médecine Générale.
Unité de recherche Soins Primaires et Santé
Doctor in biomedical sciences
Towards a system of concepts for Family Medicine.
Multilingual indexing in General Practice/ Family Medicine in the era
of SemanticWeb
by Dr. Marc JAMOULLE
Introduction
This thesis is about giving visibility to the often overlooked work of family
physicians and consequently, is about grey literature in General Practice
and Family Medicine (GP/FM). It often seems that conference organizers
do not think of GP/FM as a knowledge-producing discipline that deserves
active dissemination. A conference is organized, but not much is done with
the knowledge shared at these meetings. In turn, the knowledge cannot be
reused or reapplied. This these is also about indexing. To find knowledge
back, indexing is mandatory. We must prepare tools that will automatically
index the thousands of abstracts that family doctors produce each year in
various languages. And finally this work is about semantics1. It is an introduction
to health terminologies, ontologies, semantic data, and linked
open data. All are expressions of the next step: Semantic Web for health
care data. Concepts, units of thought expressed by terms, will be our target
and must have the ability to be expressed in multiple languages. In turn,
three areas of knowledge are at stake in this study: (i) Family Medicine as a
pillar of primary health care, (ii) computational linguistics, and (iii) health
information systems.
Aim
• To identify knowledge produced by General practitioners (GPs) by
improving annotation of grey literature in Primary Health Care
• To propose an experimental indexing system, acting as draft for a
standardized table of content of GP/GM
• To improve the searchability of repositories for grey literature in GP/GM.
1For specific terms, see the Glossary page 257
x
Methods
The first step aimed to design the taxonomy by identifying relevant concepts
in a compiled corpus of GP/FM texts. We have studied the concepts
identified in nearly two thousand communications of GPs during
conferences. The relevant concepts belong to the fields that are focusing
on GP/FM activities (e.g. teaching, ethics, management or environmental
hazard issues).
The second step was the development of an on-line, multilingual, terminological
resource for each category of the resulting taxonomy, named
Q-Codes. We have designed this terminology in the form of a lightweight
ontology, accessible on-line for readers and ready for use by computers of
the semantic web. It is also fit for the Linked Open Data universe.
Results
We propose 182 Q-Codes in an on-line multilingual database (10 languages)
(www.hetop.eu/Q) acting each as a filter for Medline. Q-Codes are also available
under the form of Unique Resource Identifiers (URIs) and are exportable
in Web Ontology Language (OWL). The International Classification of Primary
Care (ICPC) is linked to Q-Codes in order to form the Core Content
Classification in General Practice/Family Medicine (3CGP). So far, 3CGP is
in use by humans in pedagogy, in bibliographic studies, in indexing congresses,
master theses and other forms of grey literature in GP/FM. Use by
computers is experimented in automatic classifiers, annotators and natural
language processing.
Discussion
To the best of our knowledge, this is the first attempt to expand the ICPC
coding system with an extension for family physician contextual issues,
thus covering non-clinical content of practice. It remains to be proven that
our proposed terminology will help in dealing with more complex systems,
such as MeSH, to support information storage and retrieval activities.
However, this exercise is proposed as a first step in the creation of an ontology
of GP/FM and as an opening to the complex world of Semantic Web
technologies.
Conclusion
We expect that the creation of this terminological resource for indexing abstracts
and for facilitating Medline searches for general practitioners, researchers
and students in medicine will reduce loss of knowledge in the
domain of GP/FM. In addition, through better indexing of the grey literature
(congress abstracts, master’s and doctoral theses), we hope to enhance
the accessibility of research results and give visibility to the invisible work
of family physicians
Integrative bioinformatics and graph-based methods for predicting adverse effects of developmental drugs
Adverse drug effects are complex phenomena that involve the interplay between drug molecules and their protein targets at various levels of biological organisation, from molecular to organismal. Many factors are known to contribute toward the safety profile of a drug, including the chemical properties of the drug molecule itself, the biological properties of drug targets and other proteins that are involved in pharmacodynamics and pharmacokinetics aspects of drug action, and the characteristics of the intended patient population. A multitude of scattered publicly available resources exist that cover these important aspects of drug activity. These include manually curated biological databases, high-throughput experimental results from gene expression and human genetics resources as well as drug labels and registered clinical trial records. This thesis proposes an integrated analysis of these disparate sources of information to help bridge the gap between the molecular and the clinical aspects of drug action. For example, to address the commonly held assumption that narrowly expressed proteins make safer drug targets, an integrative data-driven analysis was conducted to systematically investigate the relationship between the tissue expression profile of drug targets and the organs affected by clinically observed adverse drug reactions. Similarly, human genetics data were used extensively throughout the thesis to compare adverse symptoms induced by drug molecules with the phenotypes associated with the genes encoding their target proteins. One of the main outcomes of this thesis was the generation of a large knowledge graph, which incorporates diverse molecular and phenotypic data in a structured network format. To leverage the integrated information, two graph-based machine learning methods were developed to predict a wide range of adverse drug effects caused by approved and developmental therapies
Word-sense disambiguation in biomedical ontologies
With the ever increase in biomedical literature, text-mining has emerged as an important technology to support bio-curation and search. Word sense disambiguation (WSD), the correct identification of terms in text in the light of ambiguity, is an important problem in text-mining. Since the late 1940s many approaches based on supervised (decision trees, naive Bayes, neural networks, support vector machines) and unsupervised machine learning (context-clustering, word-clustering, co-occurrence graphs) have been developed. Knowledge-based methods that make use of the WordNet computational lexicon have also been developed. But only few make use of ontologies, i.e. hierarchical controlled vocabularies, to solve the problem and none exploit inference over ontologies and the use of metadata from publications.
This thesis addresses the WSD problem in biomedical ontologies by suggesting different approaches for word sense disambiguation that use ontologies and metadata. The "Closest Sense" method assumes that the ontology defines multiple senses of the term; it computes the shortest path of co-occurring terms in the document to one of these senses. The "Term Cooc" method defines a log-odds ratio for co-occurring terms including inferred co-occurrences. The "MetaData" approach trains a classifier on metadata; it does not require any ontology, but requires training data, which the other methods do not. These approaches are compared to each other when applied to a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The MetaData approach performs best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The Term Cooc approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The Closest Sense approach achieves on average 80% success rate.
Furthermore, the thesis showcases applications ranging from ontology design to semantic search where WSD is important
Proceedings of The Tenth International Workshop on Ontology Matching (OM-2015)
shvaiko2016aInternational audienceno abstrac
COMPLEX QUESTION ANSWERING BASED ON A SEMANTIC DOMAIN MODEL OF CLINICAL MEDICINE
Much research in recent years has focused on question answering. Due to significant advances in answering simple fact-seeking questions, research is moving towards resolving complex questions. An approach adopted by many researchers is to decompose a complex question into a series of fact-seeking questions and reuse techniques developed for answering simple questions. This thesis presents an alternative novel approach to domain-specific complex question answering based on consistently applying a semantic domain model to question and document understanding as well as to answer extraction and generation. This study uses a semantic domain model of clinical medicine to encode (a) a clinician's information need expressed as a question on the one hand and (b) the meaning of scientific publications on the other to yield a common representation. It is hypothesized that this approach will work well for (1) finding documents that contain answers to clinical questions and (2) extracting these answers from the documents. The domain of clinical question answering was selected primarily because of its unparalleled resources that permit providing a proof by construction for this hypothesis. In addition, a working prototype of a clinical question answering system will support research in informed clinical decision making. The proposed methodology is based on the semantic domain model developed within the paradigm of Evidence Based Medicine. Three basic components of this model - the clinical task, a framework for capturing a synopsis of a clinical scenario that generated the question, and strength of evidence presented in an answer - are identified and discussed in detail. Algorithms and methods were developed that combine knowledge-based and statistical techniques to extract the basic components of the domain model from abstracts of biomedical articles. These algorithms serve as a foundation for the prototype end-to-end clinical question answering system that was built and evaluated to test the hypotheses. Evaluation of the system on test collections developed in the course of this work and based on real life clinical questions demonstrates feasibility of complex question answering and high accuracy information retrieval using a semantic domain model