Search CORE

60 research outputs found

STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

Author: Abeysinghe Rashmie
Publication venue: UKnowledge
Publication date: 01/01/2020
Field of study

Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56

University of Kentucky

Scalable Approaches for Auditing the Completeness of Biomedical Ontologies

Author: Zheng Fengbo
Publication venue: UKnowledge
Publication date: 01/01/2021
Field of study

An ontology provides a formalized representation of knowledge within a domain. In biomedicine, ontologies have been widely used in modern biomedical applications to enable semantic interoperability and facilitate data exchange. Given the important roles that biomedical ontologies play, quality issues such as incompleteness, if not addressed, can affect the quality of downstream ontology-driven applications. However, biomedical ontologies often have large sizes and complex structures. Thus, it is infeasible to uncover potential quality issues through manual effort. In this dissertation, we introduce automated and scalable approaches for auditing the completeness of biomedical ontologies. We mainly focus on two incompleteness issues -- missing hierarchical relations and missing concepts. To identify missing hierarchical relations, we develop three approaches: a lexical-based approach, a hybrid approach utilizing both lexical features and logical definitions, and an approach based on concept name transformation. To identify missing concepts, a lexical-based Formal Concept Analysis (FCA) method is proposed for concept enrichment. We also predict proper concept names for the missing concepts using deep learning techniques. Manual review by domain experts is performed to evaluate these approaches. In addition, we leverage extrinsic knowledge (i.e., external ontologies) to help validate the detected incompleteness issues. The auditing approaches have been applied to a variety of biomedical ontologies, including the SNOMED CT, National Cancer Institute (NCI) Thesaurus and Gene Ontology. In the first lexical-based approach to identify missing hierarchical relations, each concept is modeled with an enriched set of lexical features, leveraging words and noun phrases in the name of the concept itself and the concept\u27s ancestors. Given a pair of concepts that are not linked by a hierarchical relation, if the enriched lexical attributes of one concept is a superset of the other\u27s, a potentially missing hierarchical relation will be suggested. Applying this approach to the September 2017 release of SNOMED CT (US edition) suggested 38,615 potentially missing hierarchical relations. A domain expert reviewed a random sample of 100 potentially missing ones, and confirmed 90 are valid (a precision of 90%). In the second work, a hybrid approach is proposed to detect missing hierarchical relations in non-lattice subgraphs. For each concept, its lexical features are harmonized with role definitions to provide a more comprehensive semantic model. Then a two-step subsumption testing is performed to automatically suggest potentially missing hierarchical relations. This approach identified 55 potentially missing hierarchical relations in the 19.08d version of the NCI Thesaurus. 29 out of 55 were confirmed as valid by the curators from the NCI Enterprise Vocabulary Service (EVS) and have been incorporated in the newer versions of the NCI Thesaurus. 7 out of 55 further revealed incorrect existing hierarchical relations in the NCI Thesaurus. In the third work, we introduce a transformation-based method that leverages the Unified Medical Language System (UMLS) knowledge to identify missing hierarchical relations in its source ontologies. Given a concept name, noun chunks within it are identified and replaced by their more general counterparts to generate new concept names that are supposed to be more general than the original one. Applying this method to the UMLS (2019AB release), a total of 39,359 potentially missing hierarchical relations were detected in 13 source ontologies. Domain experts evaluated a random sample of 200 potentially missing hierarchical relations identified in the SNOMED CT (US edition), and 100 in the Gene Ontology. 173 out of 200 and 63 out of 100 potentially missing hierarchical relations were confirmed by domain experts, indicating our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively. In the work of concept enrichment, we introduce a lexical method based on FCA to identify potentially missing concepts. Lexical features (i.e., words appearing in the concept names) are considered as FCA attributes while generating formal context. Applying multistage intersection on FCA attributes results in newly formalized concepts along with bags of words that can be utilized to name the concepts. This method was applied to the Disease or Disorder sub-hierarchy in the 19.08d version of the NCI Thesaurus and identified 8,983 potentially missing concepts. We performed a preliminary evaluation and validated that 592 out of 8,983 potentially missing concepts were included in external ontologies in the UMLS. After obtaining new concepts and their relevant bags of words, we further developed deep learning-based approaches to automatically predict concept names that comply with the naming convention of a specific ontology. We explored simple neural network, Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) combined with LSTM. Our experiments showed that the LSTM-based approach achieved the best performance with an F1 score of 63.41% for predicting names for newly added concepts in the March 2018 release of SNOMED CT (US Edition) and an F1 score of 73.95% for naming missing concepts revealed by our previous work. In the last part of this dissertation, extrinsic knowledge is leveraged to collect supporting evidence for the detected incompleteness issues. We present a work in which cross-ontology evaluation based on extrinsic knowledge from the UMLS is utilized to help validate potentially missing hierarchical relations, aiming at relieving the heavy workload of manual review

University of Kentucky

COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY

Author: Hinderer Eugene Waverly, III
Publication venue: UKnowledge
Publication date: 01/01/2019
Field of study

Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability. We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph. We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects

University of Kentucky

GOcats: A Tool for Categorizing Gene Ontology into Subgraphs of User-Defined Concepts

Author: Hinderer Eugene Waverly, III
Moseley Hunter N. B.
Publication venue: UKnowledge
Publication date: 11/06/2020
Field of study

Gene Ontology is used extensively in scientific knowledgebases and repositories to organize a wealth of biological information. However, interpreting annotations derived from differential gene lists is often difficult without manually sorting into higher-order categories. To address these issues, we present GOcats, a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics. We tested GOcats performance using subcellular location categories to mine annotations from GO-utilizing knowledgebases and evaluated their accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). In comparison to term categorizations generated from UniProt’s controlled vocabulary and from GO slims via OWLTools’ Map2Slim, GOcats outperformed these methods in its ability to mimic human-categorized GO term sets. Unlike the other methods, GOcats relies only on an input of basic keywords from the user (e.g. biologist), not a manually compiled or static set of top-level GO terms. Additionally, by identifying and properly defining relations with respect to semantic scope, GOcats can utilize the traditionally problematic relation, has_part, without encountering erroneous term mapping. We applied GOcats in the comparison of HPA-sourced knowledgebase annotations to experimentally-derived annotations provided by HPA directly. During the comparison, GOcats improved correspondence between the annotation sources by adjusting semantic granularity. GOcats enables the creation of custom, GO slim-like filters to map fine-grained gene annotations from gene annotation files to general subcellular compartments without needing to hand-select a set of GO terms for categorization. Moreover, GOcats can customize the level of semantic specificity for annotation categories. Furthermore, GOcats enables a safe and more comprehensive semantic scoping utilization of go-core, allowing for a more complete utilization of information available in GO. Together, these improvements can impact a variety of GO knowledgebase data mining use-cases as well as knowledgebase curation and quality control

University of Kentucky

Combining ontologies and rules with clinical archetypes

Author: Lezcano Matías Leonardo
Publication venue
Publication date: 01/01/2012
Field of study

Al igual que otros campos que dependen en gran medida de las funcionalidades ofrecidas por las tecnologías de la información y las comunicaciones (IT), la biomedicina y la salud necesitan cada vez más la implantación de normas y mecanismos ampliamente aceptados para el intercambio de datos, información y conocimiento. Dicha necesidad de compatibilidad e interoperabilidad va más allá de las cuestiones sintácticas y estructurales, pues la interoperabilidad semántica es también requerida. La interoperabilidad a nivel semántico es esencial para el soporte computarizado de alertas, flujos de trabajo y de la medicina basada en evidencia cuando contamos con la presencia de sistemas heterogéneos de Historia Clínica Electrónica (EHR). El modelo de arquetipos clínicos respaldado por el estándar CEN/ISO EN13606 y la fundación openEHR ofrece un mecanismo para expresar las estructuras de datos clínicos de manera compartida e interoperable. El modelo ha ido ganando aceptación en los últimos años por su capacidad para definir conceptos clínicos basados en un Modelo de Referencia común. Dicha separación a dos capas permite conservar la heterogeneidad de las implementaciones de almacenamiento a bajo nivel, presentes en los diferentes sistemas de EHR. Sin embargo, los lenguajes de arquetipos no soportan la representación de reglas clínicas ni el mapeo a ontologías formales, ambos elementos fundamentales para alcanzar la interoperabilidad semántica completa pues permiten llevar a cabo el razonamiento y la inferencia a partir del conocimiento clínico existente. Paralelamente, es reconocido el hecho de que la World Wide Web presenta requisitos análogos a los descritos anteriormente, lo cual ha fomentado el desarrollo de la Web Semántica. El progreso alcanzado en este terreno, con respecto a la representación del conocimiento y al razonamiento sobre el mismo, es combinado en esta tesis con los modelos de EHR con el objetivo de mejorar el enfoque de los arquetipos clínicos y ofrecer funcionalidades que se corresponden con nivel más alto de interoperabilidad semántica. Concretamente, la investigación que se describe a continuación presenta y evalúa un enfoque para traducir automáticamente las definiciones expresadas en el lenguaje de definición de arquetipos de openEHR (ADL) a una representación formal basada en lenguajes de ontologías. El método se implementa en la plataforma ArchOnt, que también es descrita. A continuación se estudia la integración de dichas representaciones formales con reglas clínicas, ofreciéndose un enfoque para reutilizar el razonamiento con instancias concretas de datos clínicos. Es importante ver como el acto de compartir el conocimiento clínico expresado a través de reglas es coherente con la filosofía de intercambio abierto fomentada por los arquetipos, a la vez que se extiende la reutilización a proposiciones de conocimiento declarativo como las utilizadas en las guías de práctica clínica. De esta manera, la tesis describe una técnica de mapeo de arquetipos a ontologías, para luego asociar reglas clínicas a la representación resultante. La traducción automática también permite la conexión formal de los elementos especificados en los arquetipos con conceptos clínicos equivalentes provenientes de otras fuentes como son las terminologías clínicas. Dichos enlaces fomentan la reutilización del conocimiento clínico ya representado, así como el razonamiento y la navegación a través de distintas ontologías clínicas. Otra contribución significativa de la tesis es la aplicación del enfoque mencionado en dos proyectos de investigación y desarrollo clínico, llevados a cabo en combinación con hospitales universitarios de Madrid. En la explicación se incluyen ejemplos de las aplicaciones más representativas del enfoque como es el caso del desarrollo de sistemas de alertas orientados a mejorar la seguridad del paciente. No obstante, la traducción automática de arquetipos clínicos a lenguajes de ontologías constituye una base común para la implementación de una amplia gama de actividades semánticas, razonamiento y validación, evitándose así la necesidad de aplicar distintos enfoques ad-hoc directamente sobre los arquetipos para poder satisfacer las condiciones de cada contexto

e_Buah - Biblioteca Digital de la Universidad de Alcalá

Recommended from our members

Human concept cognition and semantic relations in the unified medical language system: A coherence analysis.

Author: Assefa Shimelis G.
Publication venue: 'University of North Texas Libraries'
Publication date: 01/08/2007
Field of study

There is almost a universal agreement among scholars in information retrieval (IR) research that knowledge representation needs improvement. As core component of an IR system, improvement of the knowledge representation system has so far involved manipulation of this component based on principles such as vector space, probabilistic approach, inference network, and language modeling, yet the required improvement is still far from fruition. One promising approach that is highly touted to offer a potential solution exists in the cognitive paradigm, where knowledge representation practice should involve, or start from, modeling the human conceptual system. This study based on two related cognitive theories: the theory-based approach to concept representation and the psychological theory of semantic relations, ventured to explore the connection between the human conceptual model and the knowledge representation model (represented by samples of concepts and relations from the unified medical language system, UMLS). Guided by these cognitive theories and based on related and appropriate data-analytic tools, such as nonmetric multidimensional scaling, hierarchical clustering, and content analysis, this study aimed to conduct an exploratory investigation to answer four related questions. Divided into two groups, a total of 89 research participants took part in two sets of cognitive tasks. The first group (49 participants) sorted 60 food names into categories followed by simultaneous description of the derived categories to explain the rationale for category judgment. The second group (40 participants) performed sorting 47 semantic relations (the nonhierarchical associative types) into 5 categories known a priori. Three datasets resulted as a result of the cognitive tasks: food-sorting data, relation-sorting data, and free and unstructured text of category descriptions. Using the data analytic tools mentioned, data analysis was carried out and important results and findings were obtained that offer plausible explanations to the 4 research questions. Major results include the following: (a) through discriminant analysis category members were predicted consistently in 70% of the time; (b) the categorization bases are largely simplified rules, naïve explanations, and feature-based; (c) individuals theoretical explanation remains valid and stays stable across category members; (d) the human conceptual model can be fairly reconstructed in a low-dimensional space where 93% of the variance in the dimensional space is accounted for by the subjects performance; (e) participants consistently classify 29 of the 47 semantic relations; and, (f) individuals perform better in the functional and spatial dimensions of the semantic relations classification task and perform poorly in the conceptual dimension

UNT Digital Library

Ontological realism: A methodology for coordinated evolution of scientiﬁc ontologies

Author: Ceusters Werner
Smith Barry
Publication venue
Publication date: 01/01/2010
Field of study

PhilPapers

Term-driven E-Commerce

Author: Rolletschek Gerhard
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 05/02/2007
Field of study

Die Arbeit nimmt sich der textuellen Dimension des E-Commerce an. Grundlegende Hypothese ist die textuelle Gebundenheit von Information und Transaktion im Bereich des elektronischen Handels. Überall dort, wo Produkte und Dienstleistungen angeboten, nachgefragt, wahrgenommen und bewertet werden, kommen natürlichsprachige Ausdrücke zum Einsatz. Daraus resultiert ist zum einen, wie bedeutsam es ist, die Varianz textueller Beschreibungen im E-Commerce zu erfassen, zum anderen können die umfangreichen textuellen Ressourcen, die bei E-Commerce-Interaktionen anfallen, im Hinblick auf ein besseres Verständnis natürlicher Sprache herangezogen werden

Digitale Hochschulschriften der LMU

Recommended from our members

A modular, open-source information extraction framework for identifying clinical concepts and processes of care in clinical narratives

Author: Gooch P.
Publication venue
Publication date
Field of study

In this thesis, a synthesis is presented of the knowledge models required by clinical informa- tion systems that provide decision support for longitudinal processes of care. Qualitative research techniques and thematic analysis are novelly applied to a systematic review of the literature on the challenges in implementing such systems, leading to the development of an original conceptual framework. The thesis demonstrates how these process-oriented systems make use of a knowledge base derived from workflow models and clinical guidelines, and argues that one of the major barriers to implementation is the need to extract explicit and implicit information from diverse resources in order to construct the knowledge base. Moreover, concepts in both the knowledge base and in the electronic health record (EHR) must be mapped to a common ontological model. However, the majority of clinical guideline information remains in text form, and much of the useful clinical information residing in the EHR resides in the free text fields of progress notes and laboratory reports. In this thesis, it is shown how natural language processing and information extraction techniques provide a means to identify and formalise the knowledge components required by the knowledge base. Original contributions are made in the development of lexico-syntactic patterns and the use of external domain knowledge resources to tackle a variety of information extraction tasks in the clinical domain, such as recognition of clinical concepts, events, temporal relations, term disambiguation and abbreviation expansion. Methods are developed for adapting existing tools and resources in the biomedical domain to the processing of clinical texts, and approaches to improving the scalability of these tools are proposed and evalu- ated. These tools and techniques are then combined in the creation of a novel approach to identifying processes of care in the clinical narrative. It is demonstrated that resolution of coreferential and anaphoric relations as narratively and temporally ordered chains provides a means to extract linked narrative events and processes of care from clinical notes. Coreference performance in discharge summaries and progress notes is largely dependent on correct identification of protagonist chains (patient, clinician, family relation), pronominal resolution, and string matching that takes account of experiencer, temporal, spatial, and anatomical context; whereas for laboratory reports additional, external domain knowledge is required. The types of external knowledge and their effects on system performance are identified and evaluated. Results are compared against existing systems for solving these tasks and are found to improve on them, or to approach the performance of recently reported, state-of-the- art systems. Software artefacts developed in this research have been made available as open-source components within the General Architecture for Text Engineering framework

City Research Online

Combining ontologies and rules with clinical archetypes

Author: Lezcano Matías Leonardo
Publication venue
Publication date: 01/01/2012
Field of study

e_Buah - Biblioteca Digital de la Universidad de Alcalá

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblioteca Digital de la Universidad de Alcalá