15 research outputs found

    STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

    Get PDF
    Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56

    Impact of language skills and system experience on medical information retrieval

    No full text

    Structural analysis and auditing of SNOMED hierarchies using abstraction networks

    Get PDF
    SNOMED is one of the leading healthcare terminologies being used worldwide. Due to its sheer volume and continuing expansion, it is inevitable that errors will make their way into SNOMED. Thus, quality assurance is an important part of its maintenance cycle. A structural approach is presented in this dissertation, aiming at developing automated techniques that can aid auditors in the discovery of terminology errors more effectively and efficiently. Large SNOMED hierarchies are partitioned, based primarily on their relationships patterns, into concept groups of more manageable sizes. Three related abstraction networks with respect to a SNOMED hierarchy, namely the area taxonomy, partial-area taxonomy, and disjoint partial-area taxonomy, are derived programmatically from the partitions. Altogether they afford high-level abstraction views of the underlying hierarchy, each with different granularity. The area taxonomy gives a global structural view of a SNOMED hierarchy, while the partial-area taxonomy focuses more on the semantic uniformity and hierarchical proximity of concepts. The disjoint partial-area taxonomy is devised as an enhancement of the partial-area taxonomy and is based on the partition of the entire collection of so-called overlapping concepts into singly-rooted groups. The taxonomies are exploited as the basis for a number of systematic auditing regimens, with a theme that complex concepts are more error-prone and require special attention in auditing activities. In general, group-based auditing is promoted to achieve a more efficient review within semantically uniform groups. Certain concept groups in the different taxonomies are deemed “complex” according to various criteria and thus deserve focused auditing. Examples of these include strict inheritance regions in the partial-area taxonomy and overlapping partial-areas in the disjoint partial-area taxonomy. Multiple hypotheses are formulated to characterize the error distributions and ratios with respect to different concept groups presented by the taxonomies, and thus further establish their efficacy as vehicles for auditing. The methodologies are demonstrated using SNOMED’s Specimen hierarchy as the test bed. Auditing results are reported and analyzed to assess the hypotheses. With the use of the double bootstrap and Fisher’s exact test (two-tailed), the aforementioned hypotheses are confirmed. Auditing on various complex concept groups based on the taxonomies is shown to yield a statistically significant higher proportion of errors

    Mapping of electronic health records in Spanish to the unified medical language system metathesaurus

    Get PDF
    [EN] This work presents a preliminary approach to annotate Spanish electronic health records with concepts of the Unified Medical Language System Metathesaurus. The prototype uses Apache Lucene R to index the Metathesaurus and generate mapping candidates from input text. In addition, it relies on UKB to resolve ambiguities. The tool has been evaluated by measuring its agreement with MetaMap in two English-Spanish parallel corpora, one consisting of titles and abstracts of papers in the clinical domain, and the other of real electronic health record excerpts.[EU] Lan honetan, espainieraz idatzitako mediku-txosten elektronikoak Unified Medical Languge System Metathesaurus deituriko terminologia biomedikoarekin etiketatzeko lehen urratsak eman dira. Prototipoak Apache Lucene R erabiltzen du Metathesaurus-a indexatu eta mapatze hautagaiak sortzeko. Horrez gain, anbiguotasunak UKB bidez ebazten ditu. Ebaluazioari dagokionez, prototipoaren eta MetaMap-en arteko adostasuna neurtu da bi ingelera-gaztelania corpus paralelotan. Corpusetako bat artikulu zientifikoetako izenburu eta laburpenez osatutako dago. Beste corpusa mediku-txosten pasarte batzuez dago osatuta

    Mapping of electronic health records in Spanish to the unified medical language system metathesaurus

    Get PDF
    [EN] This work presents a preliminary approach to annotate Spanish electronic health records with concepts of the Unified Medical Language System Metathesaurus. The prototype uses Apache Lucene R to index the Metathesaurus and generate mapping candidates from input text. In addition, it relies on UKB to resolve ambiguities. The tool has been evaluated by measuring its agreement with MetaMap in two English-Spanish parallel corpora, one consisting of titles and abstracts of papers in the clinical domain, and the other of real electronic health record excerpts.[EU] Lan honetan, espainieraz idatzitako mediku-txosten elektronikoak Unified Medical Languge System Metathesaurus deituriko terminologia biomedikoarekin etiketatzeko lehen urratsak eman dira. Prototipoak Apache Lucene R erabiltzen du Metathesaurus-a indexatu eta mapatze hautagaiak sortzeko. Horrez gain, anbiguotasunak UKB bidez ebazten ditu. Ebaluazioari dagokionez, prototipoaren eta MetaMap-en arteko adostasuna neurtu da bi ingelera-gaztelania corpus paralelotan. Corpusetako bat artikulu zientifikoetako izenburu eta laburpenez osatutako dago. Beste corpusa mediku-txosten pasarte batzuez dago osatuta

    Structural indicators for effective quality assurance of snomed ct

    Get PDF
    The Standardized Nomenclature of Medicine -- Clinical Terms (SNOMED CT -- further abbreviated as SCT) has been endorsed as a premier clinical terminology by many national and international organizations. The US Government has chosen SCT to play a significant role in its initiative to promote Electronic Health Record (EH R) country-wide. However, there is evidence suggesting that, at the moment, SCT is not optimally modeled for its intended use by healthcare practitioners. There is a need to perform quality assurance (QA) of SCT to help expedite its use as a reference terminology for clinical purposes as planned for EH R use. The central theme of this dissertation is to define a group-based auditing methodology to effectively identify concepts of SCT that require QA. As such, similarity sets are introduced which are groups of concepts that are lexically identical except for one word. Concepts in a similarity set are expected to be modeled in a consistent way. If not, the set is considered to be inconsistent and submitted for review by an auditor. Initial studies found 38% of such sets to be inconsistent. The effectiveness of these sets is further improved through the use of three structural indicators. Using such indicators as the number of parents, relationships and role groups, up to 70% of the similarity sets and 32.6% of the concepts are found to exhibit inconsistencies. Furthermore, positional similarity sets, which are similarity sets with the same position of the differing word in the concept’s terms, are introduced to improve the likelihood of finding errors at the concept level. This strictness in the position of the differing word increases the lexical similarity between the concepts of a set thereby increasing the contrast between lexical similarities and modeling differences. This increase in contrast increases the likelihood of finding inconsistencies. The effectiveness of positional similarity sets in finding inconsistencies is further improved by using the same three structural indicators as discussed above in the generation of these sets. An analysis of 50 sample sets with differences in the number of relationships reveal 41.6% of the concepts to be inconsistent. Moreover, a study is performed to fully automate the process of suggesting attributes to enhance the modeling of SCT concepts using positional similarity sets. A technique is also used to automatically suggest the corresponding target values. An analysis of 50 sample concepts show that, of the 103 suggested attributes, 67 are manually confirmed to be correct. Finally, a study is conducted to examine the readiness of SCT problem list (PL) to support meaningful use of EHR. The results show that the concepts in PL suffer from the same issues as general SCT concepts, although to a slightly lesser extent, and do require further QA efforts. To support such efforts, structural indicators in the form of the number of parents and the number of words are shown to be effective in ferreting out potentially problematic concepts in which QA efforts should be focused. A structural indicator to find concepts with synonymy problems is also presented by finding pairs of SCT concepts that map to the same UMLS concept

    Computing Healthcare Quality Indicators Automatically: Secondary Use of Patient Data and Semantic Interoperability

    Get PDF
    Harmelen, F.A.H. van [Promotor]Keizer, N.F. de [Copromotor]Cornet, R. [Copromotor]Teije, A.C.M. [Copromotor

    Contributions to information extraction for spanish written biomedical text

    Get PDF
    285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

    Processamento automático de texto de narrativas clínicas

    Get PDF
    The informatization of medical systems and the subsequent move towards the usage of Electronic Health Records (EHR) over the paper format by medical professionals allowed for safer and more e cient healthcare. Additionally, EHR can also be used as a data source for observational studies around the world. However, it is estimated that 70-80% of all clinical data is in the form of unstructured free text and regarding the data that is structured, not all of it follows the same standards, making it di cult to use on the mentioned observational studies. This dissertation aims to tackle those two adversities using natural language processing for the task of extracting concepts from free text and, afterwards, use a common data model to harmonize the data. The developed system employs an annotator, namely cTAKES, to extract the concepts from free text. The extracted concepts are then normalized using text preprocessing, word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the normalized concepts are converted to the OMOP Common Data Model and stored in a database. In order to test the developed system, the i2b2 2010 data set was used. The di erent components of the system were tested and evaluated separately, with the concept extraction component achieving a precision, recall and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization component was evaluated by completing the N2C2 2019 challenge track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP CDM conversion component, it was observed that 7.92% of the concepts were lost during the process. In conclusion, even though the developed system still has margin for improvements, it proves to be a viable method of automatically processing clinical narratives.A informatização dos sistemas médicos e a subsequente tendência por parte de profissionais de saúde a substituir registos em formato de papel por registos eletrónicos de saúde, permitiu que os serviços de saúde se tornassem mais seguros e eficientes. Além disso, estes registos eletrónicos apresentam também o benefício de poderem ser utilizados como fonte de dados para estudos observacionais. No entanto, estima-se que 70-80% de todos os dados clínicos se encontrem na forma de texto livre não-estruturado e os dados que estão estruturados não seguem todos os mesmos padrões, dificultando o seu potencial uso nos estudos observacionais. Esta dissertação pretende solucionar essas duas adversidades através do uso de processamento de linguagem natural para a tarefa de extrair conceitos de texto livre e, de seguida, usar um modelo comum de dados para os harmonizar. O sistema desenvolvido utiliza um anotador, especificamente o cTAKES, para extrair conceitos de texto livre. Os conceitos extraídos são, então, normalizados através de técnicas de pré-processamento de texto, Word Embeddings, MetaMap e um sistema de procura no Metathesaurus do UMLS. Por fim, os conceitos normalizados são convertidos para o modelo comum de dados da OMOP e guardados numa base de dados. Para testar o sistema desenvolvido usou-se o conjunto de dados i2b2 de 2010. As diferentes partes do sistema foram testadas e avaliadas individualmente sendo que na extração dos conceitos obteve-se uma precisão, recall e F-score de 77.12%, 70.29% e 73.55%, respetivamente. A normalização foi avaliada através do desafio N2C2 2019-track 3 onde se obteve uma exatidão de 77.5%. Na conversão para o modelo comum de dados OMOP observou-se que durante a conversão perderam-se 7.92% dos conceitos. Concluiu-se que, embora o sistema desenvolvido ainda tenha margem para melhorias, este demonstrou-se como um método viável de processamento automático do texto de narrativas clínicas.Mestrado em Engenharia de Computadores e Telemátic
    corecore