7 research outputs found

    Preliminary results using Snorocket to detect errors in the post-coordination of SNOMED CT

    Get PDF
    We present preliminary results for the application of a procedure that detects and corrects errors in concept definitions of a local interface vocabulary with SNOMED CT as its reference vocabulary. Using the relations inferred by SNOROCKET we detected redundant fully defined concepts, but also we detected suspected patterns where concepts had redundant inferred relations. Our procedure detected errors in 1.63% of the whole vocabulary, the primary type of error was produced by duplications since these concepts did not exist when the knowledge modeler asserted them. Using these results, we implemented a GUI to track patterns and correct errors. Our procedure contributes to the quality assurance of our local interface vocabulary since errors in the hierarchies can compromise interoperability and meaningful use of the vocabulary. Our approach could be used by thesaurus implementers to detect suspected patterns, grouping them, and offer a centralized interface to correct them.VIII Workshop Innovación en Sistemas de Software (WISS).Red de Universidades con Carreras en Informática (RedUNCI

    Preliminary results using Snorocket to detect errors in the post-coordination of SNOMED CT

    Get PDF
    We present preliminary results for the application of a procedure that detects and corrects errors in concept definitions of a local interface vocabulary with SNOMED CT as its reference vocabulary. Using the relations inferred by SNOROCKET we detected redundant fully defined concepts, but also we detected suspected patterns where concepts had redundant inferred relations. Our procedure detected errors in 1.63% of the whole vocabulary, the primary type of error was produced by duplications since these concepts did not exist when the knowledge modeler asserted them. Using these results, we implemented a GUI to track patterns and correct errors. Our procedure contributes to the quality assurance of our local interface vocabulary since errors in the hierarchies can compromise interoperability and meaningful use of the vocabulary. Our approach could be used by thesaurus implementers to detect suspected patterns, grouping them, and offer a centralized interface to correct them.VIII Workshop Innovación en Sistemas de Software (WISS).Red de Universidades con Carreras en Informática (RedUNCI

    Preliminary results using Snorocket to detect errors in the post-coordination of SNOMED CT

    Get PDF
    We present preliminary results for the application of a procedure that detects and corrects errors in concept definitions of a local interface vocabulary with SNOMED CT as its reference vocabulary. Using the relations inferred by SNOROCKET we detected redundant fully defined concepts, but also we detected suspected patterns where concepts had redundant inferred relations. Our procedure detected errors in 1.63% of the whole vocabulary, the primary type of error was produced by duplications since these concepts did not exist when the knowledge modeler asserted them. Using these results, we implemented a GUI to track patterns and correct errors. Our procedure contributes to the quality assurance of our local interface vocabulary since errors in the hierarchies can compromise interoperability and meaningful use of the vocabulary. Our approach could be used by thesaurus implementers to detect suspected patterns, grouping them, and offer a centralized interface to correct them.VIII Workshop Innovación en Sistemas de Software (WISS).Red de Universidades con Carreras en Informática (RedUNCI

    Auditing description-logic-based medical terminological systems by detecting equivalent concept definitions

    No full text
    OBJECTIVE: To specify and evaluate a method for auditing medical terminological systems (TSs) based on detecting concepts with equivalent definitions. This method addresses two important problems: redundancy, where the same concept is represented more than once (described by different terms), and underspecification, where different concepts have the same representation and hence appear indistinguishable from each other. DESIGN: The auditing method is applicable for TSs that are or can be represented in a description logic (DL). The method relies on the assumption that concept definitions are non-primitive (i.e. they are regarded as providing necessary and sufficient conditions). Whereas this assumption may not hold for many definitions, it does serve the purpose of detecting sets of logically equivalent concepts by a DL reasoner. Such a set may include the same concept which is defined more than once and/or different concepts that are underspecified as they appear indistinguishable from each other by their represented properties. Analysis of these sets provides insight into the representation quality of concepts and provides hints at improving the TS. MEASUREMENTS: In our case study the method is applied to the DICE TS, a comprehensive TS in intensive care. It comprises about 2500 concepts and 40 properties and relations. RESULTS: In DICE we found four concepts that were defined twice. Furthermore, 100 sets were found containing more than 300 underspecified concepts. The sizes of these sets ranged from 2 to 13. Analysis revealed that many concepts can be more completely defined, either by adding existing relations, or by the introduction of new relations into the terminological system. CONCLUSION: The method proved both usable and valuable for auditing TSs. DL reasoning is fully automated and all equivalent concept definitions are systematically found. The resulting sets of equivalent concepts clearly point out which concept definitions are to be reviewed, as they contain duplicate definitions of a concept, and (inherently or unnecessarily) underspecified concept

    Structural indicators for effective quality assurance of snomed ct

    Get PDF
    The Standardized Nomenclature of Medicine -- Clinical Terms (SNOMED CT -- further abbreviated as SCT) has been endorsed as a premier clinical terminology by many national and international organizations. The US Government has chosen SCT to play a significant role in its initiative to promote Electronic Health Record (EH R) country-wide. However, there is evidence suggesting that, at the moment, SCT is not optimally modeled for its intended use by healthcare practitioners. There is a need to perform quality assurance (QA) of SCT to help expedite its use as a reference terminology for clinical purposes as planned for EH R use. The central theme of this dissertation is to define a group-based auditing methodology to effectively identify concepts of SCT that require QA. As such, similarity sets are introduced which are groups of concepts that are lexically identical except for one word. Concepts in a similarity set are expected to be modeled in a consistent way. If not, the set is considered to be inconsistent and submitted for review by an auditor. Initial studies found 38% of such sets to be inconsistent. The effectiveness of these sets is further improved through the use of three structural indicators. Using such indicators as the number of parents, relationships and role groups, up to 70% of the similarity sets and 32.6% of the concepts are found to exhibit inconsistencies. Furthermore, positional similarity sets, which are similarity sets with the same position of the differing word in the concept’s terms, are introduced to improve the likelihood of finding errors at the concept level. This strictness in the position of the differing word increases the lexical similarity between the concepts of a set thereby increasing the contrast between lexical similarities and modeling differences. This increase in contrast increases the likelihood of finding inconsistencies. The effectiveness of positional similarity sets in finding inconsistencies is further improved by using the same three structural indicators as discussed above in the generation of these sets. An analysis of 50 sample sets with differences in the number of relationships reveal 41.6% of the concepts to be inconsistent. Moreover, a study is performed to fully automate the process of suggesting attributes to enhance the modeling of SCT concepts using positional similarity sets. A technique is also used to automatically suggest the corresponding target values. An analysis of 50 sample concepts show that, of the 103 suggested attributes, 67 are manually confirmed to be correct. Finally, a study is conducted to examine the readiness of SCT problem list (PL) to support meaningful use of EHR. The results show that the concepts in PL suffer from the same issues as general SCT concepts, although to a slightly lesser extent, and do require further QA efforts. To support such efforts, structural indicators in the form of the number of parents and the number of words are shown to be effective in ferreting out potentially problematic concepts in which QA efforts should be focused. A structural indicator to find concepts with synonymy problems is also presented by finding pairs of SCT concepts that map to the same UMLS concept

    Structural analysis and auditing of SNOMED hierarchies using abstraction networks

    Get PDF
    SNOMED is one of the leading healthcare terminologies being used worldwide. Due to its sheer volume and continuing expansion, it is inevitable that errors will make their way into SNOMED. Thus, quality assurance is an important part of its maintenance cycle. A structural approach is presented in this dissertation, aiming at developing automated techniques that can aid auditors in the discovery of terminology errors more effectively and efficiently. Large SNOMED hierarchies are partitioned, based primarily on their relationships patterns, into concept groups of more manageable sizes. Three related abstraction networks with respect to a SNOMED hierarchy, namely the area taxonomy, partial-area taxonomy, and disjoint partial-area taxonomy, are derived programmatically from the partitions. Altogether they afford high-level abstraction views of the underlying hierarchy, each with different granularity. The area taxonomy gives a global structural view of a SNOMED hierarchy, while the partial-area taxonomy focuses more on the semantic uniformity and hierarchical proximity of concepts. The disjoint partial-area taxonomy is devised as an enhancement of the partial-area taxonomy and is based on the partition of the entire collection of so-called overlapping concepts into singly-rooted groups. The taxonomies are exploited as the basis for a number of systematic auditing regimens, with a theme that complex concepts are more error-prone and require special attention in auditing activities. In general, group-based auditing is promoted to achieve a more efficient review within semantically uniform groups. Certain concept groups in the different taxonomies are deemed “complex” according to various criteria and thus deserve focused auditing. Examples of these include strict inheritance regions in the partial-area taxonomy and overlapping partial-areas in the disjoint partial-area taxonomy. Multiple hypotheses are formulated to characterize the error distributions and ratios with respect to different concept groups presented by the taxonomies, and thus further establish their efficacy as vehicles for auditing. The methodologies are demonstrated using SNOMED’s Specimen hierarchy as the test bed. Auditing results are reported and analyzed to assess the hypotheses. With the use of the double bootstrap and Fisher’s exact test (two-tailed), the aforementioned hypotheses are confirmed. Auditing on various complex concept groups based on the taxonomies is shown to yield a statistically significant higher proportion of errors

    STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

    Get PDF
    Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56
    corecore