60 research outputs found

    STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

    Get PDF
    Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56

    Scalable Approaches for Auditing the Completeness of Biomedical Ontologies

    Get PDF
    An ontology provides a formalized representation of knowledge within a domain. In biomedicine, ontologies have been widely used in modern biomedical applications to enable semantic interoperability and facilitate data exchange. Given the important roles that biomedical ontologies play, quality issues such as incompleteness, if not addressed, can affect the quality of downstream ontology-driven applications. However, biomedical ontologies often have large sizes and complex structures. Thus, it is infeasible to uncover potential quality issues through manual effort. In this dissertation, we introduce automated and scalable approaches for auditing the completeness of biomedical ontologies. We mainly focus on two incompleteness issues -- missing hierarchical relations and missing concepts. To identify missing hierarchical relations, we develop three approaches: a lexical-based approach, a hybrid approach utilizing both lexical features and logical definitions, and an approach based on concept name transformation. To identify missing concepts, a lexical-based Formal Concept Analysis (FCA) method is proposed for concept enrichment. We also predict proper concept names for the missing concepts using deep learning techniques. Manual review by domain experts is performed to evaluate these approaches. In addition, we leverage extrinsic knowledge (i.e., external ontologies) to help validate the detected incompleteness issues. The auditing approaches have been applied to a variety of biomedical ontologies, including the SNOMED CT, National Cancer Institute (NCI) Thesaurus and Gene Ontology. In the first lexical-based approach to identify missing hierarchical relations, each concept is modeled with an enriched set of lexical features, leveraging words and noun phrases in the name of the concept itself and the concept\u27s ancestors. Given a pair of concepts that are not linked by a hierarchical relation, if the enriched lexical attributes of one concept is a superset of the other\u27s, a potentially missing hierarchical relation will be suggested. Applying this approach to the September 2017 release of SNOMED CT (US edition) suggested 38,615 potentially missing hierarchical relations. A domain expert reviewed a random sample of 100 potentially missing ones, and confirmed 90 are valid (a precision of 90%). In the second work, a hybrid approach is proposed to detect missing hierarchical relations in non-lattice subgraphs. For each concept, its lexical features are harmonized with role definitions to provide a more comprehensive semantic model. Then a two-step subsumption testing is performed to automatically suggest potentially missing hierarchical relations. This approach identified 55 potentially missing hierarchical relations in the 19.08d version of the NCI Thesaurus. 29 out of 55 were confirmed as valid by the curators from the NCI Enterprise Vocabulary Service (EVS) and have been incorporated in the newer versions of the NCI Thesaurus. 7 out of 55 further revealed incorrect existing hierarchical relations in the NCI Thesaurus. In the third work, we introduce a transformation-based method that leverages the Unified Medical Language System (UMLS) knowledge to identify missing hierarchical relations in its source ontologies. Given a concept name, noun chunks within it are identified and replaced by their more general counterparts to generate new concept names that are supposed to be more general than the original one. Applying this method to the UMLS (2019AB release), a total of 39,359 potentially missing hierarchical relations were detected in 13 source ontologies. Domain experts evaluated a random sample of 200 potentially missing hierarchical relations identified in the SNOMED CT (US edition), and 100 in the Gene Ontology. 173 out of 200 and 63 out of 100 potentially missing hierarchical relations were confirmed by domain experts, indicating our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively. In the work of concept enrichment, we introduce a lexical method based on FCA to identify potentially missing concepts. Lexical features (i.e., words appearing in the concept names) are considered as FCA attributes while generating formal context. Applying multistage intersection on FCA attributes results in newly formalized concepts along with bags of words that can be utilized to name the concepts. This method was applied to the Disease or Disorder sub-hierarchy in the 19.08d version of the NCI Thesaurus and identified 8,983 potentially missing concepts. We performed a preliminary evaluation and validated that 592 out of 8,983 potentially missing concepts were included in external ontologies in the UMLS. After obtaining new concepts and their relevant bags of words, we further developed deep learning-based approaches to automatically predict concept names that comply with the naming convention of a specific ontology. We explored simple neural network, Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) combined with LSTM. Our experiments showed that the LSTM-based approach achieved the best performance with an F1 score of 63.41% for predicting names for newly added concepts in the March 2018 release of SNOMED CT (US Edition) and an F1 score of 73.95% for naming missing concepts revealed by our previous work. In the last part of this dissertation, extrinsic knowledge is leveraged to collect supporting evidence for the detected incompleteness issues. We present a work in which cross-ontology evaluation based on extrinsic knowledge from the UMLS is utilized to help validate potentially missing hierarchical relations, aiming at relieving the heavy workload of manual review

    COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY

    Get PDF
    Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability. We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph. We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects

    GOcats: A Tool for Categorizing Gene Ontology into Subgraphs of User-Defined Concepts

    Get PDF
    Gene Ontology is used extensively in scientific knowledgebases and repositories to organize a wealth of biological information. However, interpreting annotations derived from differential gene lists is often difficult without manually sorting into higher-order categories. To address these issues, we present GOcats, a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics. We tested GOcats performance using subcellular location categories to mine annotations from GO-utilizing knowledgebases and evaluated their accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). In comparison to term categorizations generated from UniProt’s controlled vocabulary and from GO slims via OWLTools’ Map2Slim, GOcats outperformed these methods in its ability to mimic human-categorized GO term sets. Unlike the other methods, GOcats relies only on an input of basic keywords from the user (e.g. biologist), not a manually compiled or static set of top-level GO terms. Additionally, by identifying and properly defining relations with respect to semantic scope, GOcats can utilize the traditionally problematic relation, has_part, without encountering erroneous term mapping. We applied GOcats in the comparison of HPA-sourced knowledgebase annotations to experimentally-derived annotations provided by HPA directly. During the comparison, GOcats improved correspondence between the annotation sources by adjusting semantic granularity. GOcats enables the creation of custom, GO slim-like filters to map fine-grained gene annotations from gene annotation files to general subcellular compartments without needing to hand-select a set of GO terms for categorization. Moreover, GOcats can customize the level of semantic specificity for annotation categories. Furthermore, GOcats enables a safe and more comprehensive semantic scoping utilization of go-core, allowing for a more complete utilization of information available in GO. Together, these improvements can impact a variety of GO knowledgebase data mining use-cases as well as knowledgebase curation and quality control

    Combining ontologies and rules with clinical archetypes

    Get PDF
    Al igual que otros campos que dependen en gran medida de las funcionalidades ofrecidas por las tecnologías de la información y las comunicaciones (IT), la biomedicina y la salud necesitan cada vez más la implantación de normas y mecanismos ampliamente aceptados para el intercambio de datos, información y conocimiento. Dicha necesidad de compatibilidad e interoperabilidad va más allá de las cuestiones sintácticas y estructurales, pues la interoperabilidad semántica es también requerida. La interoperabilidad a nivel semántico es esencial para el soporte computarizado de alertas, flujos de trabajo y de la medicina basada en evidencia cuando contamos con la presencia de sistemas heterogéneos de Historia Clínica Electrónica (EHR). El modelo de arquetipos clínicos respaldado por el estándar CEN/ISO EN13606 y la fundación openEHR ofrece un mecanismo para expresar las estructuras de datos clínicos de manera compartida e interoperable. El modelo ha ido ganando aceptación en los últimos años por su capacidad para definir conceptos clínicos basados en un Modelo de Referencia común. Dicha separación a dos capas permite conservar la heterogeneidad de las implementaciones de almacenamiento a bajo nivel, presentes en los diferentes sistemas de EHR. Sin embargo, los lenguajes de arquetipos no soportan la representación de reglas clínicas ni el mapeo a ontologías formales, ambos elementos fundamentales para alcanzar la interoperabilidad semántica completa pues permiten llevar a cabo el razonamiento y la inferencia a partir del conocimiento clínico existente. Paralelamente, es reconocido el hecho de que la World Wide Web presenta requisitos análogos a los descritos anteriormente, lo cual ha fomentado el desarrollo de la Web Semántica. El progreso alcanzado en este terreno, con respecto a la representación del conocimiento y al razonamiento sobre el mismo, es combinado en esta tesis con los modelos de EHR con el objetivo de mejorar el enfoque de los arquetipos clínicos y ofrecer funcionalidades que se corresponden con nivel más alto de interoperabilidad semántica. Concretamente, la investigación que se describe a continuación presenta y evalúa un enfoque para traducir automáticamente las definiciones expresadas en el lenguaje de definición de arquetipos de openEHR (ADL) a una representación formal basada en lenguajes de ontologías. El método se implementa en la plataforma ArchOnt, que también es descrita. A continuación se estudia la integración de dichas representaciones formales con reglas clínicas, ofreciéndose un enfoque para reutilizar el razonamiento con instancias concretas de datos clínicos. Es importante ver como el acto de compartir el conocimiento clínico expresado a través de reglas es coherente con la filosofía de intercambio abierto fomentada por los arquetipos, a la vez que se extiende la reutilización a proposiciones de conocimiento declarativo como las utilizadas en las guías de práctica clínica. De esta manera, la tesis describe una técnica de mapeo de arquetipos a ontologías, para luego asociar reglas clínicas a la representación resultante. La traducción automática también permite la conexión formal de los elementos especificados en los arquetipos con conceptos clínicos equivalentes provenientes de otras fuentes como son las terminologías clínicas. Dichos enlaces fomentan la reutilización del conocimiento clínico ya representado, así como el razonamiento y la navegación a través de distintas ontologías clínicas. Otra contribución significativa de la tesis es la aplicación del enfoque mencionado en dos proyectos de investigación y desarrollo clínico, llevados a cabo en combinación con hospitales universitarios de Madrid. En la explicación se incluyen ejemplos de las aplicaciones más representativas del enfoque como es el caso del desarrollo de sistemas de alertas orientados a mejorar la seguridad del paciente. No obstante, la traducción automática de arquetipos clínicos a lenguajes de ontologías constituye una base común para la implementación de una amplia gama de actividades semánticas, razonamiento y validación, evitándose así la necesidad de aplicar distintos enfoques ad-hoc directamente sobre los arquetipos para poder satisfacer las condiciones de cada contexto

    Term-driven E-Commerce

    Get PDF
    Die Arbeit nimmt sich der textuellen Dimension des E-Commerce an. Grundlegende Hypothese ist die textuelle Gebundenheit von Information und Transaktion im Bereich des elektronischen Handels. Überall dort, wo Produkte und Dienstleistungen angeboten, nachgefragt, wahrgenommen und bewertet werden, kommen natürlichsprachige Ausdrücke zum Einsatz. Daraus resultiert ist zum einen, wie bedeutsam es ist, die Varianz textueller Beschreibungen im E-Commerce zu erfassen, zum anderen können die umfangreichen textuellen Ressourcen, die bei E-Commerce-Interaktionen anfallen, im Hinblick auf ein besseres Verständnis natürlicher Sprache herangezogen werden

    Combining ontologies and rules with clinical archetypes

    Get PDF
    Al igual que otros campos que dependen en gran medida de las funcionalidades ofrecidas por las tecnologías de la información y las comunicaciones (IT), la biomedicina y la salud necesitan cada vez más la implantación de normas y mecanismos ampliamente aceptados para el intercambio de datos, información y conocimiento. Dicha necesidad de compatibilidad e interoperabilidad va más allá de las cuestiones sintácticas y estructurales, pues la interoperabilidad semántica es también requerida. La interoperabilidad a nivel semántico es esencial para el soporte computarizado de alertas, flujos de trabajo y de la medicina basada en evidencia cuando contamos con la presencia de sistemas heterogéneos de Historia Clínica Electrónica (EHR). El modelo de arquetipos clínicos respaldado por el estándar CEN/ISO EN13606 y la fundación openEHR ofrece un mecanismo para expresar las estructuras de datos clínicos de manera compartida e interoperable. El modelo ha ido ganando aceptación en los últimos años por su capacidad para definir conceptos clínicos basados en un Modelo de Referencia común. Dicha separación a dos capas permite conservar la heterogeneidad de las implementaciones de almacenamiento a bajo nivel, presentes en los diferentes sistemas de EHR. Sin embargo, los lenguajes de arquetipos no soportan la representación de reglas clínicas ni el mapeo a ontologías formales, ambos elementos fundamentales para alcanzar la interoperabilidad semántica completa pues permiten llevar a cabo el razonamiento y la inferencia a partir del conocimiento clínico existente. Paralelamente, es reconocido el hecho de que la World Wide Web presenta requisitos análogos a los descritos anteriormente, lo cual ha fomentado el desarrollo de la Web Semántica. El progreso alcanzado en este terreno, con respecto a la representación del conocimiento y al razonamiento sobre el mismo, es combinado en esta tesis con los modelos de EHR con el objetivo de mejorar el enfoque de los arquetipos clínicos y ofrecer funcionalidades que se corresponden con nivel más alto de interoperabilidad semántica. Concretamente, la investigación que se describe a continuación presenta y evalúa un enfoque para traducir automáticamente las definiciones expresadas en el lenguaje de definición de arquetipos de openEHR (ADL) a una representación formal basada en lenguajes de ontologías. El método se implementa en la plataforma ArchOnt, que también es descrita. A continuación se estudia la integración de dichas representaciones formales con reglas clínicas, ofreciéndose un enfoque para reutilizar el razonamiento con instancias concretas de datos clínicos. Es importante ver como el acto de compartir el conocimiento clínico expresado a través de reglas es coherente con la filosofía de intercambio abierto fomentada por los arquetipos, a la vez que se extiende la reutilización a proposiciones de conocimiento declarativo como las utilizadas en las guías de práctica clínica. De esta manera, la tesis describe una técnica de mapeo de arquetipos a ontologías, para luego asociar reglas clínicas a la representación resultante. La traducción automática también permite la conexión formal de los elementos especificados en los arquetipos con conceptos clínicos equivalentes provenientes de otras fuentes como son las terminologías clínicas. Dichos enlaces fomentan la reutilización del conocimiento clínico ya representado, así como el razonamiento y la navegación a través de distintas ontologías clínicas. Otra contribución significativa de la tesis es la aplicación del enfoque mencionado en dos proyectos de investigación y desarrollo clínico, llevados a cabo en combinación con hospitales universitarios de Madrid. En la explicación se incluyen ejemplos de las aplicaciones más representativas del enfoque como es el caso del desarrollo de sistemas de alertas orientados a mejorar la seguridad del paciente. No obstante, la traducción automática de arquetipos clínicos a lenguajes de ontologías constituye una base común para la implementación de una amplia gama de actividades semánticas, razonamiento y validación, evitándose así la necesidad de aplicar distintos enfoques ad-hoc directamente sobre los arquetipos para poder satisfacer las condiciones de cada contexto
    corecore