54 research outputs found

    Enrichment of ontologies using machine learning and summarization

    Get PDF
    Biomedical ontologies are structured knowledge systems in biomedicine. They play a major role in enabling precise communications in support of healthcare applications, e.g., Electronic Healthcare Records (EHR) systems. Biomedical ontologies are used in many different contexts to facilitate information and knowledge management. The most widely used clinical ontology is the SNOMED CT. Placing a new concept into its proper position in an ontology is a fundamental task in its lifecycle of curation and enrichment. A large biomedical ontology, which typically consists of many tens of thousands of concepts and relationships, can be viewed as a complex network with concepts as nodes and relationships as links. This large-size node-link diagram can easily become overwhelming for humans to understand or work with. Adding concepts is a challenging and time-consuming task that requires domain knowledge and ontology skills. IS-A links (aka subclass links) are the most important relationships of an ontology, enabling the inheritance of other relationships. The position of a concept, represented by its IS-A links to other concepts, determines how accurately it is modeled. Therefore, considering as many parent candidate concepts as possible leads to better modeling of this concept. Traditionally, curators rely on classifiers to place concepts into ontologies. However, this assumes the accurate relationship modeling of the new concept as well as the existing concepts. Since many concepts in existing ontologies, are underspecified in terms of their relationships, the placement by classifiers may be wrong. In cases where the curator does not manually check the automatic placement by classifier programs, concepts may end up in wrong positions in the IS-A hierarchy. A user searching for a concept, without knowing its precise name, would not find it in its expected location. Automated or semi-automated techniques that can place a concept or narrow down the places where to insert it, are highly desirable. Hence, this dissertation is addressing the problem of concept placement by automatically identifying IS-A links and potential parent concepts correctly and effectively for new concepts, with the assistance of two powerful techniques, Machine Learning (ML) and Abstraction Networks (AbNs). Modern neural networks have revolutionized Machine Learning in vision and Natural Language Processing (NLP). They also show great promise for ontology-related tasks, including ontology enrichment, i.e., insertion of new concepts. This dissertation presents research using ML and AbNs to achieve knowledge enrichment of ontologies. Abstraction networks (AbNs), are compact summary networks that preserve a significant amount of the semantics and structure of the underlying ontologies. An Abstraction Network is automatically derived from the ontology itself. It consists of nodes, where each node represents a set of concepts that are similar in their structure and semantics. Various kinds of AbNs have been previously developed by the Structural Analysis of Biomedical Ontologies Center (SABOC) to support the summarization, visualization, and quality assurance (QA) of biomedical ontologies. Two basic kinds of AbNs are the Area Taxonomy and the Partial-area Taxonomy, which have been developed for various biomedical ontologies (e.g., SNOMED CT of SNOMED International and NCIt of the National Cancer Institute). This dissertation presents four enrichment studies of SNOMED CT, utilizing both ML and AbN-based techniques

    Scalable Approaches for Auditing the Completeness of Biomedical Ontologies

    Get PDF
    An ontology provides a formalized representation of knowledge within a domain. In biomedicine, ontologies have been widely used in modern biomedical applications to enable semantic interoperability and facilitate data exchange. Given the important roles that biomedical ontologies play, quality issues such as incompleteness, if not addressed, can affect the quality of downstream ontology-driven applications. However, biomedical ontologies often have large sizes and complex structures. Thus, it is infeasible to uncover potential quality issues through manual effort. In this dissertation, we introduce automated and scalable approaches for auditing the completeness of biomedical ontologies. We mainly focus on two incompleteness issues -- missing hierarchical relations and missing concepts. To identify missing hierarchical relations, we develop three approaches: a lexical-based approach, a hybrid approach utilizing both lexical features and logical definitions, and an approach based on concept name transformation. To identify missing concepts, a lexical-based Formal Concept Analysis (FCA) method is proposed for concept enrichment. We also predict proper concept names for the missing concepts using deep learning techniques. Manual review by domain experts is performed to evaluate these approaches. In addition, we leverage extrinsic knowledge (i.e., external ontologies) to help validate the detected incompleteness issues. The auditing approaches have been applied to a variety of biomedical ontologies, including the SNOMED CT, National Cancer Institute (NCI) Thesaurus and Gene Ontology. In the first lexical-based approach to identify missing hierarchical relations, each concept is modeled with an enriched set of lexical features, leveraging words and noun phrases in the name of the concept itself and the concept\u27s ancestors. Given a pair of concepts that are not linked by a hierarchical relation, if the enriched lexical attributes of one concept is a superset of the other\u27s, a potentially missing hierarchical relation will be suggested. Applying this approach to the September 2017 release of SNOMED CT (US edition) suggested 38,615 potentially missing hierarchical relations. A domain expert reviewed a random sample of 100 potentially missing ones, and confirmed 90 are valid (a precision of 90%). In the second work, a hybrid approach is proposed to detect missing hierarchical relations in non-lattice subgraphs. For each concept, its lexical features are harmonized with role definitions to provide a more comprehensive semantic model. Then a two-step subsumption testing is performed to automatically suggest potentially missing hierarchical relations. This approach identified 55 potentially missing hierarchical relations in the 19.08d version of the NCI Thesaurus. 29 out of 55 were confirmed as valid by the curators from the NCI Enterprise Vocabulary Service (EVS) and have been incorporated in the newer versions of the NCI Thesaurus. 7 out of 55 further revealed incorrect existing hierarchical relations in the NCI Thesaurus. In the third work, we introduce a transformation-based method that leverages the Unified Medical Language System (UMLS) knowledge to identify missing hierarchical relations in its source ontologies. Given a concept name, noun chunks within it are identified and replaced by their more general counterparts to generate new concept names that are supposed to be more general than the original one. Applying this method to the UMLS (2019AB release), a total of 39,359 potentially missing hierarchical relations were detected in 13 source ontologies. Domain experts evaluated a random sample of 200 potentially missing hierarchical relations identified in the SNOMED CT (US edition), and 100 in the Gene Ontology. 173 out of 200 and 63 out of 100 potentially missing hierarchical relations were confirmed by domain experts, indicating our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively. In the work of concept enrichment, we introduce a lexical method based on FCA to identify potentially missing concepts. Lexical features (i.e., words appearing in the concept names) are considered as FCA attributes while generating formal context. Applying multistage intersection on FCA attributes results in newly formalized concepts along with bags of words that can be utilized to name the concepts. This method was applied to the Disease or Disorder sub-hierarchy in the 19.08d version of the NCI Thesaurus and identified 8,983 potentially missing concepts. We performed a preliminary evaluation and validated that 592 out of 8,983 potentially missing concepts were included in external ontologies in the UMLS. After obtaining new concepts and their relevant bags of words, we further developed deep learning-based approaches to automatically predict concept names that comply with the naming convention of a specific ontology. We explored simple neural network, Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) combined with LSTM. Our experiments showed that the LSTM-based approach achieved the best performance with an F1 score of 63.41% for predicting names for newly added concepts in the March 2018 release of SNOMED CT (US Edition) and an F1 score of 73.95% for naming missing concepts revealed by our previous work. In the last part of this dissertation, extrinsic knowledge is leveraged to collect supporting evidence for the detected incompleteness issues. We present a work in which cross-ontology evaluation based on extrinsic knowledge from the UMLS is utilized to help validate potentially missing hierarchical relations, aiming at relieving the heavy workload of manual review

    STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

    Get PDF
    Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56

    Natural Language Processing and Graph Representation Learning for Clinical Data

    Get PDF
    The past decade has witnessed remarkable progress in biomedical informatics and its related fields: the development of high-throughput technologies in genomics, the mass adoption of electronic health records systems, and the AI renaissance largely catalyzed by deep learning. Deep learning has played an undeniably important role in our attempts to reduce the gap between the exponentially growing amount of biomedical data and our ability to make sense of them. In particular, the two main pillars of this dissertation---natural language processing and graph representation learning---have improved our capacity to learn useful representations of language and structured data to an extent previously considered unattainable in such a short time frame. In the context of clinical data, characterized by its notorious heterogeneity and complexity, natural language processing and graph representation learning have begun to enrich our toolkits for making sense and making use of the wealth of biomedical data beyond rule-based systems or traditional regression techniques. This dissertation comes at the cusp of such a paradigm shift, detailing my journey across the fields of biomedical and clinical informatics through the lens of natural language processing and graph representation learning. The takeaway is quite optimistic: despite the many layers of inefficiencies and challenges in the healthcare ecosystem, AI for healthcare is gearing up to transform the world in new and exciting ways

    DeepOnto: A Python Package for Ontology Engineering with Deep Learning

    Full text link
    Applying deep learning techniques, particularly language models (LMs), in ontology engineering has raised widespread attention. However, deep learning frameworks like PyTorch and Tensorflow are predominantly developed for Python programming, while widely-used ontology APIs, such as the OWL API and Jena, are primarily Java-based. To facilitate seamless integration of these frameworks and APIs, we present Deeponto, a Python package designed for ontology engineering. The package encompasses a core ontology processing module founded on the widely-recognised and reliable OWL API, encapsulating its fundamental features in a more "Pythonic" manner and extending its capabilities to include other essential components including reasoning, verbalisation, normalisation, projection, and more. Building on this module, Deeponto offers a suite of tools, resources, and algorithms that support various ontology engineering tasks, such as ontology alignment and completion, by harnessing deep learning methodologies, primarily pre-trained LMs. In this paper, we also demonstrate the practical utility of Deeponto through two use-cases: the Digital Health Coaching in Samsung Research UK and the Bio-ML track of the Ontology Alignment Evaluation Initiative (OAEI).Comment: under review at Semantic Web Journa

    Biomedical entities recognition in Spanish combining word embeddings

    Get PDF
    El reconocimiento de entidades con nombre (NER) es una tarea importante en el campo del Procesamiento del Lenguaje Natural que se utiliza para extraer conocimiento significativo de los documentos textuales. El objetivo de NER es identificar trozos de texto que se refieran a entidades específicas. En esta tesis pretendemos abordar la tarea de NER en el dominio biomédico y en español. En este dominio las entidades pueden referirse a nombres de fármacos, síntomas y enfermedades y ofrecen un conocimiento valioso a los expertos sanitarios. Para ello, proponemos un modelo basado en redes neuronales y empleamos una combinación de word embeddings. Además, nosotros generamos unos nuevos embeddings específicos del dominio y del idioma para comprobar su eficacia. Finalmente, demostramos que la combinación de diferentes word embeddings como entrada a la red neuronal mejora los resultados del estado de la cuestión en los escenarios aplicados.Named Entity Recognition (NER) is an important task in the field of Natural Language Processing that is used to extract meaningful knowledge from textual documents. The goal of NER is to identify text fragments that refer to specific entities. In this thesis we aim to address the task of NER in the Spanish biomedical domain. In this domain entities can refer to drug, symptom and disease names and offer valuable knowledge to health experts. For this purpose, we propose a model based on neural networks and employ a combination of word embeddings. In addition, we generate new domain- and language-specific embeddings to test their effectiveness. Finally, we show that the combination of different word embeddings as input to the neural network improves the state-of-the-art results in the applied scenarios.Tesis Univ. Jaén. Departamento de Informática. Leída el 22 abril de 2021

    Front-Line Physicians' Satisfaction with Information Systems in Hospitals

    Get PDF
    Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe

    Learning Clinical Data Representations for Machine Learning

    Get PDF
    corecore