18 research outputs found

    Logic-based assessment of the compatibility of UMLS ontology sources

    Get PDF
    Background: The UMLS Metathesaurus (UMLS-Meta) is currently the most comprehensive effort for integrating independently-developed medical thesauri and ontologies. UMLS-Meta is being used in many applications, including PubMed and ClinicalTrials.gov. The integration of new sources combines automatic techniques, expert assessment, and auditing protocols. The automatic techniques currently in use, however, are mostly based on lexical algorithms and often disregard the semantics of the sources being integrated. Results: In this paper, we argue that UMLS-Meta’s current design and auditing methodologies could be significantly enhanced by taking into account the logic-based semantics of the ontology sources. We provide empirical evidence suggesting that UMLS-Meta in its 2009AA version contains a significant number of errors; these errors become immediately apparent if the rich semantics of the ontology sources is taken into account, manifesting themselves as unintended logical consequences that follow from the ontology sources together with the information in UMLS-Meta. We then propose general principles and specific logic-based techniques to effectively detect and repair such errors. Conclusions: Our results suggest that the methodologies employed in the design of UMLS-Meta are not only very costly in terms of human effort, but also error-prone. The techniques presented here can be useful for both reducing human effort in the design and maintenance of UMLS-Meta and improving the quality of its contents

    Doctor of Philosophy

    Get PDF
    dissertationBiomedical data are a rich source of information and knowledge. Not only are they useful for direct patient care, but they may also offer answers to important population-based questions. Creating an environment where advanced analytics can be performed against biomedical data is nontrivial, however. Biomedical data are currently scattered across multiple systems with heterogeneous data, and integrating these data is a bigger task than humans can realistically do by hand; therefore, automatic biomedical data integration is highly desirable but has never been fully achieved. This dissertation introduces new algorithms that were devised to support automatic and semiautomatic integration of heterogeneous biomedical data. The new algorithms incorporate both data mining and biomedical informatics techniques to create "concept bags" that are used to compute similarity between data elements in the same way that "word bags" are compared in data mining. Concept bags are composed of controlled medical vocabulary concept codes that are extracted from text using named-entity recognition software. To test the new algorithm, three biomedical text similarity use cases were examined: automatically aligning data elements between heterogeneous data sets, determining degrees of similarity between medical terms using a published benchmark, and determining similarity between ICU discharge summaries. The method is highly configurable and 5 different versions were tested. The concept bag method performed particularly well aligning data elements and outperformed the compared algorithms by iv more than 5%. Another configuration that included hierarchical semantics performed particularly well at matching medical terms, meeting or exceeding 30 of 31 other published results using the same benchmark. Results for the third scenario of computing ICU discharge summary similarity were less successful. Correlations between multiple methods were low, including between terminologists. The concept bag algorithms performed consistently and comparatively well and appear to be viable options for multiple scenarios. New applications of the method and ideas for improving the algorithm are being discussed for future work, including several performance enhancements, configuration-based enhancements, and concept vector weighting using the TF-IDF formulas

    Approximate string matching methods for duplicate detection and clustering tasks

    Get PDF
    Approximate string matching methods are utilized by a vast number of duplicate detection and clustering applications in various knowledge domains. The application area is expected to grow due to the recent significant increase in the amount of digital data and knowledge sources. Despite the large number of existing string similarity metrics, there is a need for more precise approximate string matching methods to improve the efficiency of computer-driven data processing, thus decreasing labor-intensive human involvement. This work introduces a family of novel string similarity methods, which outperform a number of effective well-known and widely used string similarity functions. The new algorithms are designed to overcome the most common problem of the existing methods which is the lack of context sensitivity. In this evaluation, the Longest Approximately Common Prefix (LACP) method achieved the highest values of average precision and maximum F1 on three out of four medical informatics datasets used. The LACP demonstrated the lowest execution time ensured by the linear computational complexity within the set of evaluated algorithms. An online interactive spell checker of biomedical terms was developed based on the LACP method. The main goal of the spell checker was to evaluate the LACP method’s ability to make it possible to estimate the similarity of resulting sets at a glance. The Shortest Path Edit Distance (SPED) outperformed all evaluated similarity functions and gained the highest possible values of the average precision and maximum F1 measures on the bioinformatics datasets. The SPED design was inspired by the preceding work on the Markov Random Field Edit Distance (MRFED). The SPED eradicates two shortcomings of the MRFED, which are prolonged execution time and moderate performance. Four modifications of the Histogram Difference (HD) method demonstrated the best performance on the majority of the life and social sciences data sources used in the experiments. The modifications of the HD algorithm were achieved using several re- scorers: HD with Normalized Smith-Waterman Re-scorer, HD with TFIDF and Jaccard re-scorers, HD with the Longest Common Prefix and TFIDF re-scorers, and HD with the Unweighted Longest Common Prefix Re-scorer. Another contribution of this dissertation includes the extensive analysis of the string similarity methods evaluation for duplicate detection and clustering tasks on the life and social sciences, bioinformatics, and medical informatics domains. The experimental results are illustrated with precision-recall charts and a number of tables presenting the average precision, maximum F1, and execution time

    Knowledge-based Biomedical Data Science 2019

    Full text link
    Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages with 3 table

    Una herramienta basada en terminologías estandarizadas para la anotación semántica de información textual

    Get PDF
    El objetivo de esta tesis es el diseño e implementación de técnicas léxicas, sintácticas y semánticas que permitan aprovechar al máximo los recursos de conocimiento disponibles para mejorar la extracción y el análisis de la información relevante contenida en las publicaciones científicas

    BIOMEDICAL ONTOLOGIES: EXAMINING ASPECTS OF INTEGRATION ACROSS BREAST CANCER KNOWLEDGE DOMAINS

    Get PDF
    The key ideas developed in this thesis lie at the intersection of epistemology, philosophy of molecular biology, medicine, and computer science. I examine how the epistemic and pragmatic needs of agents distributed across particular scientific disciplines influence the domain-specific reasoning, classification, and representation of breast cancer. The motivation to undertake an interdisciplinary approach, while addressing the problems of knowledge integration, originates in the peculiarity of the integrative endeavour of sciences that is fostered by information technologies and ontology engineering methods. I analyse what knowledge integration in this new field means and how it is possible to integrate diverse knowledge domains, such as clinical and molecular. I examine the extent and character of the integration achieved through the application of biomedical ontologies. While particular disciplines target certain aspects of breast cancer-related phenomena, biomedical ontologies target biomedical knowledge about phenomena that is often captured within diverse classificatory systems and domain-specific representations. In order to integrate dispersed pieces of knowledge, which is distributed across assorted research domains and knowledgebases, ontology engineers need to deal with the heterogeneity of terminological, conceptual, and practical aims that are not always shared among the domains. Accordingly, I analyse the specificities, similarities, and diversities across the clinical and biomedical domain conceptualisations and classifications of breast cancer. Instead of favouring a unifying approach to knowledge integration, my analysis shows that heterogeneous classifications and representations originate from different epistemic and pragmatic needs, each of which brings a fruitful insight into the problem. Thus, while embracing a pluralistic view on the ontologies that are capturing various aspects of knowledge, I argue that the resulting integration should be understood in terms of a coordinated social effort to bring knowledge together as needed and when needed, rather than in terms of a unity that represents domain-specific knowledge in a uniform manner. Furthermore, I characterise biomedical ontologies and knowledgebases as a novel socio-technological medium that allows representational interoperability across the domains. As an example, which also marks my own contribution to the collaborative efforts, I present an ontology for HER2+ breast cancer phenotypes that integrates clinical and molecular knowledge in an explicit way. Through this and a number of other examples, I specify how biomedical ontologies support a mutual enrichment of knowledge across the domains, thereby enabling the application of molecular knowledge into the clinics
    corecore