106 research outputs found

    Abstraction, extension and structural auditing with the UMLS semantic network

    Get PDF
    The Unified Medical Language System (UMLS) is a two-level biomedical terminological knowledge base, consisting of the Metathesaurus (META) and the Semantic Network (SN), which is an upper-level ontology of broad categories called semantic types (STs). The two levels are related via assignments of one or more STs to each concept of the META. Although the SN provides a high-level abstraction for the META, it is not compact enough. Various metaschemas, which are compact higher-level abstraction networks of the SN, have been derived. A methodology is presented to evaluate and compare two given metaschemas, based on their structural properties. A consolidation algorithm is designed to yield a consolidated metaschema maintaining the best and avoiding the worst of the two given metaschemas. The methodology and consolidation algorithm were applied to the pair of heuristic metaschemas, the top-down metaschema and the bottom-up metaschema, which have been derived from two studies involving two groups of UMLS experts. The results show that the consolidated metaschema has better structural properties than either of the two input metaschemas. Better structural properties are expected to lead to better utilization of a metaschema in orientation and visualization of the SN. Repetitive consolidation, which leads to further structural improvements, is also shown. The META and SN were created in the absence of a comprehensive curated genomics terminology. The internal consistency of the SN\u27s categories which are relevant to genomics is evaluated and changes to improve its ability to express genomic knowledge are proposed. The completeness of the SN with respect to genomic concepts is evaluated and conesponding extensions to the SN to fill identified gaps are proposed. Due to the size and complexity of the UMLS, errors are inevitable. A group auditing methodolgy is presented, where the ST assignments for groups of similar concepts are audited. The extent of an ST, which is the group of all concepts assigned this ST, is divided into groups of concepts that have been assigned exactly the same set of STs. An algorithm finds subgroups of suspicious concepts. The auditor is presented with these subgroups, which purportedly exhibit the same semantics, and thus he will notice different concepts with wrong or missing ST assignments. Another methodology partitions these groups into smaller, singly rooted, hierarchically organized sets used to audit the hierarchical relationships. The algorithmic methodologies are compared with a comprehensive manual audit and show a very high error recall with a much higher precision than the manual exhaustive review

    Structural auditing methodologies for controlled terminologies

    Get PDF
    Several auditing methodologies for large controlled terminologies are developed. These are applied to the Unified Medical Language System XXXX and the National Cancer Institute Thesaurus (NCIT). Structural auditing methodologies are based on the structural aspects such as IS-A hierarchy relationships groups of concepts assigned to semantic types and groups of relationships defined for concepts. Structurally uniform groups of concepts tend to be semantically uniform. Structural auditing methodologies focus on concepts with unlikely or rare configuration. These concepts have a high likelihood for errors. One of the methodologies is based on comparing hierarchical relationships between the META and SN, two major knowledge sources of the UMLS. In general, a correspondence between them is expected since the SN hierarchical relationships should abstract the META hierarchical relationships. It may indicate an error when a mismatch occurs. The UMLS SN has 135 categories called semantic types. However, in spite of its medium size, the SN has limited use for comprehension purposes because it cannot be easily represented in a pictorial form, it has many (about 7,000) relationships. Therefore, a higher-level abstraction for the SN called a metaschema, is constructed. Its nodes are meta-semantic types, each representing a connected group of semantic types of the SN. One of the auditing methodologies is based on a kind of metaschema called a cohesive metaschema. The focus is placed on concepts of intersections of meta-semantic types. As is shown, such concepts have high likelihood for errors. Another auditing methodology is based on dividing the NCIT into areas according to the roles of its concepts. Moreover, each multi-rooted area is further divided into pareas that are singly rooted. Each p-area contains a group of structurally and semantically uniform concepts. These groups, as well as two derived abstraction networks called taxonomies, help in focusing on concepts with potential errors. With genomic research being at the forefront of bioscience, this auditing methodology is applied to the Gene hierarchy as well as the Biological Process hierarchy of the NCIT, since processes are very important for gene information. The results support the hypothesis that the occurrence of errors is related to the size of p-areas. Errors are more frequent for small p-areas

    Extensions of SNOMED taxonomy abstraction networks supporting auditing and complexity analysis

    Get PDF
    The Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT) has been widely used as a standard terminology in various biomedical domains. The enhancement of the quality of SNOMED contributes to the improvement of the medical systems that it supports. In previous work, the Structural Analysis of Biomedical Ontologies Center (SABOC) team has defined the partial-area taxonomy, a hierarchical abstraction network consisting of units called partial-areas. Each partial-area comprises a set of SNOMED concepts exhibiting a particular relationship structure and being distinguished by a unique root concept. In this dissertation, some extensions and applications of the taxonomy framework are considered. Some concepts appearing in multiple partial-areas have been designated as complex due to the fact that they constitute a tangled portion of a hierarchy and can be obstacles to users trying to gain an understanding of the hierarchy’s content. A methodology for partitioning the entire collection of these so-called overlapping complex concepts into singly-rooted groups was presented. A novel auditing methodology based on an enhanced abstraction network is described. In addition, the existing abstraction network relies heavily on the structure of the outgoing relationships of the concepts. But some of SNOMED hierarchies (or subhierarchies) serve only as targets of relationships, with few or no outgoing relationships of their own. This situation impedes the applicability of the abstraction network. To deal with this problem, a variation of the above abstraction network, called the converse abstraction network (CAN) is defined and derived automatically from a given SNOMED hierarchy. An auditing methodology based on the CAN is formulated. Furthermore, a preliminary study of the complementary use of the abstraction network in description logic (DL) for quality assurance purposes pertaining to SNOMED is presented. Two complexity measures, a structural complexity measure and a hierarchical complexity measure, based on the abstraction network are introduced to quantify the complexity of a SNOMED hierarchy. An extension of the two measures is also utilized specifically to track the complexity of the versions of the SNOMED hierarchies before and after a sequence of auditing processes

    Structural indicators for effective quality assurance of snomed ct

    Get PDF
    The Standardized Nomenclature of Medicine -- Clinical Terms (SNOMED CT -- further abbreviated as SCT) has been endorsed as a premier clinical terminology by many national and international organizations. The US Government has chosen SCT to play a significant role in its initiative to promote Electronic Health Record (EH R) country-wide. However, there is evidence suggesting that, at the moment, SCT is not optimally modeled for its intended use by healthcare practitioners. There is a need to perform quality assurance (QA) of SCT to help expedite its use as a reference terminology for clinical purposes as planned for EH R use. The central theme of this dissertation is to define a group-based auditing methodology to effectively identify concepts of SCT that require QA. As such, similarity sets are introduced which are groups of concepts that are lexically identical except for one word. Concepts in a similarity set are expected to be modeled in a consistent way. If not, the set is considered to be inconsistent and submitted for review by an auditor. Initial studies found 38% of such sets to be inconsistent. The effectiveness of these sets is further improved through the use of three structural indicators. Using such indicators as the number of parents, relationships and role groups, up to 70% of the similarity sets and 32.6% of the concepts are found to exhibit inconsistencies. Furthermore, positional similarity sets, which are similarity sets with the same position of the differing word in the concept’s terms, are introduced to improve the likelihood of finding errors at the concept level. This strictness in the position of the differing word increases the lexical similarity between the concepts of a set thereby increasing the contrast between lexical similarities and modeling differences. This increase in contrast increases the likelihood of finding inconsistencies. The effectiveness of positional similarity sets in finding inconsistencies is further improved by using the same three structural indicators as discussed above in the generation of these sets. An analysis of 50 sample sets with differences in the number of relationships reveal 41.6% of the concepts to be inconsistent. Moreover, a study is performed to fully automate the process of suggesting attributes to enhance the modeling of SCT concepts using positional similarity sets. A technique is also used to automatically suggest the corresponding target values. An analysis of 50 sample concepts show that, of the 103 suggested attributes, 67 are manually confirmed to be correct. Finally, a study is conducted to examine the readiness of SCT problem list (PL) to support meaningful use of EHR. The results show that the concepts in PL suffer from the same issues as general SCT concepts, although to a slightly lesser extent, and do require further QA efforts. To support such efforts, structural indicators in the form of the number of parents and the number of words are shown to be effective in ferreting out potentially problematic concepts in which QA efforts should be focused. A structural indicator to find concepts with synonymy problems is also presented by finding pairs of SCT concepts that map to the same UMLS concept

    Scalable Approaches for Auditing the Completeness of Biomedical Ontologies

    Get PDF
    An ontology provides a formalized representation of knowledge within a domain. In biomedicine, ontologies have been widely used in modern biomedical applications to enable semantic interoperability and facilitate data exchange. Given the important roles that biomedical ontologies play, quality issues such as incompleteness, if not addressed, can affect the quality of downstream ontology-driven applications. However, biomedical ontologies often have large sizes and complex structures. Thus, it is infeasible to uncover potential quality issues through manual effort. In this dissertation, we introduce automated and scalable approaches for auditing the completeness of biomedical ontologies. We mainly focus on two incompleteness issues -- missing hierarchical relations and missing concepts. To identify missing hierarchical relations, we develop three approaches: a lexical-based approach, a hybrid approach utilizing both lexical features and logical definitions, and an approach based on concept name transformation. To identify missing concepts, a lexical-based Formal Concept Analysis (FCA) method is proposed for concept enrichment. We also predict proper concept names for the missing concepts using deep learning techniques. Manual review by domain experts is performed to evaluate these approaches. In addition, we leverage extrinsic knowledge (i.e., external ontologies) to help validate the detected incompleteness issues. The auditing approaches have been applied to a variety of biomedical ontologies, including the SNOMED CT, National Cancer Institute (NCI) Thesaurus and Gene Ontology. In the first lexical-based approach to identify missing hierarchical relations, each concept is modeled with an enriched set of lexical features, leveraging words and noun phrases in the name of the concept itself and the concept\u27s ancestors. Given a pair of concepts that are not linked by a hierarchical relation, if the enriched lexical attributes of one concept is a superset of the other\u27s, a potentially missing hierarchical relation will be suggested. Applying this approach to the September 2017 release of SNOMED CT (US edition) suggested 38,615 potentially missing hierarchical relations. A domain expert reviewed a random sample of 100 potentially missing ones, and confirmed 90 are valid (a precision of 90%). In the second work, a hybrid approach is proposed to detect missing hierarchical relations in non-lattice subgraphs. For each concept, its lexical features are harmonized with role definitions to provide a more comprehensive semantic model. Then a two-step subsumption testing is performed to automatically suggest potentially missing hierarchical relations. This approach identified 55 potentially missing hierarchical relations in the 19.08d version of the NCI Thesaurus. 29 out of 55 were confirmed as valid by the curators from the NCI Enterprise Vocabulary Service (EVS) and have been incorporated in the newer versions of the NCI Thesaurus. 7 out of 55 further revealed incorrect existing hierarchical relations in the NCI Thesaurus. In the third work, we introduce a transformation-based method that leverages the Unified Medical Language System (UMLS) knowledge to identify missing hierarchical relations in its source ontologies. Given a concept name, noun chunks within it are identified and replaced by their more general counterparts to generate new concept names that are supposed to be more general than the original one. Applying this method to the UMLS (2019AB release), a total of 39,359 potentially missing hierarchical relations were detected in 13 source ontologies. Domain experts evaluated a random sample of 200 potentially missing hierarchical relations identified in the SNOMED CT (US edition), and 100 in the Gene Ontology. 173 out of 200 and 63 out of 100 potentially missing hierarchical relations were confirmed by domain experts, indicating our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively. In the work of concept enrichment, we introduce a lexical method based on FCA to identify potentially missing concepts. Lexical features (i.e., words appearing in the concept names) are considered as FCA attributes while generating formal context. Applying multistage intersection on FCA attributes results in newly formalized concepts along with bags of words that can be utilized to name the concepts. This method was applied to the Disease or Disorder sub-hierarchy in the 19.08d version of the NCI Thesaurus and identified 8,983 potentially missing concepts. We performed a preliminary evaluation and validated that 592 out of 8,983 potentially missing concepts were included in external ontologies in the UMLS. After obtaining new concepts and their relevant bags of words, we further developed deep learning-based approaches to automatically predict concept names that comply with the naming convention of a specific ontology. We explored simple neural network, Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) combined with LSTM. Our experiments showed that the LSTM-based approach achieved the best performance with an F1 score of 63.41% for predicting names for newly added concepts in the March 2018 release of SNOMED CT (US Edition) and an F1 score of 73.95% for naming missing concepts revealed by our previous work. In the last part of this dissertation, extrinsic knowledge is leveraged to collect supporting evidence for the detected incompleteness issues. We present a work in which cross-ontology evaluation based on extrinsic knowledge from the UMLS is utilized to help validate potentially missing hierarchical relations, aiming at relieving the heavy workload of manual review

    Structural analysis and auditing of SNOMED hierarchies using abstraction networks

    Get PDF
    SNOMED is one of the leading healthcare terminologies being used worldwide. Due to its sheer volume and continuing expansion, it is inevitable that errors will make their way into SNOMED. Thus, quality assurance is an important part of its maintenance cycle. A structural approach is presented in this dissertation, aiming at developing automated techniques that can aid auditors in the discovery of terminology errors more effectively and efficiently. Large SNOMED hierarchies are partitioned, based primarily on their relationships patterns, into concept groups of more manageable sizes. Three related abstraction networks with respect to a SNOMED hierarchy, namely the area taxonomy, partial-area taxonomy, and disjoint partial-area taxonomy, are derived programmatically from the partitions. Altogether they afford high-level abstraction views of the underlying hierarchy, each with different granularity. The area taxonomy gives a global structural view of a SNOMED hierarchy, while the partial-area taxonomy focuses more on the semantic uniformity and hierarchical proximity of concepts. The disjoint partial-area taxonomy is devised as an enhancement of the partial-area taxonomy and is based on the partition of the entire collection of so-called overlapping concepts into singly-rooted groups. The taxonomies are exploited as the basis for a number of systematic auditing regimens, with a theme that complex concepts are more error-prone and require special attention in auditing activities. In general, group-based auditing is promoted to achieve a more efficient review within semantically uniform groups. Certain concept groups in the different taxonomies are deemed “complex” according to various criteria and thus deserve focused auditing. Examples of these include strict inheritance regions in the partial-area taxonomy and overlapping partial-areas in the disjoint partial-area taxonomy. Multiple hypotheses are formulated to characterize the error distributions and ratios with respect to different concept groups presented by the taxonomies, and thus further establish their efficacy as vehicles for auditing. The methodologies are demonstrated using SNOMED’s Specimen hierarchy as the test bed. Auditing results are reported and analyzed to assess the hypotheses. With the use of the double bootstrap and Fisher’s exact test (two-tailed), the aforementioned hypotheses are confirmed. Auditing on various complex concept groups based on the taxonomies is shown to yield a statistically significant higher proportion of errors

    Using structural and semantic methodologies to enhance biomedical terminologies

    Get PDF
    Biomedical terminologies and ontologies underlie various Health Information Systems (HISs), Electronic Health Record (EHR) Systems, Health Information Exchanges (HIEs) and health administrative systems. Moreover, the proliferation of interdisciplinary research efforts in the biomedical field is fueling the need to overcome terminological barriers when integrating knowledge from different fields into a unified research project. Therefore well-developed and well-maintained terminologies are in high demand. Most of the biomedical terminologies are large and complex, which makes it impossible for human experts to manually detect and correct all errors and inconsistencies. Automated and semi-automated Quality Assurance methodologies that focus on areas that are more likely to contain errors and inconsistencies are therefore important. In this dissertation, structural and semantic methodologies are used to enhance biomedical terminologies. The dissertation work is divided into three major parts. The first part consists of structural auditing techniques for the Semantic Network of the Unified Medical Language System (UMLS), which serves as a vocabulary knowledge base for biomedical research in various applications. Research techniques are presented on how to automatically identify and prevent erroneous semantic type assignments to concepts. The Web-based adviseEditor system is introduced to help UMLS editors to make correct multiple semantic type assignments to concepts. It is made available to the National Library of Medicine for future use in maintaining the UMLS. The second part of this dissertation is on how to enhance the conceptual content of SNOMED CT by methods of semantic harmonization. By 2015, SNOMED will become the standard terminology for EH R encoding of diagnoses and problem lists. In order to enrich the semantics and coverage of SNOMED CT for clinical and research applications, the problem of semantic harmonization between SNOMED CT and six reference terminologies is approached by 1) comparing the vertical density of SNOM ED CT with the reference terminologies to find potential concepts for export and import; and 2) categorizing the relationships between structurally congruent concepts from pairs of terminologies, with SNOMED CT being one terminology in the pair. Six kinds of configurations are observed, e.g., alternative classifications, and suggested synonyms. For each configuration, a corresponding solution is presented for enhancing one or both of the terminologies. The third part applies Quality Assurance techniques based on “Abstraction Networks” to biomedical ontologies in BioPortal. The National Center for Biomedical Ontology provides B ioPortal as a repository of over 350 biomedical ontologies covering a wide range of domains. It is extremely difficult to design a new Quality Assurance methodology for each ontology in BioPortal. Fortunately, groups of ontologies in BioPortal share common structural features. Thus, they can be grouped into families based on combinations of these features. A uniform Quality Assurance methodology design for each family will achieve improved efficiency, which is critical with the limited Quality Assurance resources available to most ontology curators. In this dissertation, a family-based framework covering 186 BioPortal ontologies and accompanying Quality Assurance methods based on abstraction networks are presented to tackle this problem

    FEDRR: Fast, Exhaustive Detection of Redundant Hierarchical Relations for Quality Improvement of Large Biomedical Ontologies

    Get PDF
    Background: Redundant hierarchical relations refer to such patterns as two paths from one concept to another, one with length one (direct) and the other with length greater than one (indirect). Each redundant relation represents a possibly unintended defect that needs to be corrected in the ontology quality assurance process. Detecting and eliminating redundant relations would help improve the results of all methods relying on the relevant ontological systems as knowledge source, such as the computation of semantic distance between concepts and for ontology matching and alignment. Results: This paper introduces a novel and scalable approach, called FEDRR – Fast, Exhaustive Detection of Redundant Relations – for quality assurance work during ontological evolution. FEDRR combines the algorithm ideas of Dynamic Programming with Topological Sort, for exhaustive mining of all redundant hierarchical relations in ontological hierarchies, in O(c·|V|+|E|) time, where |V| is the number of concepts, |E| is the number of the relations, and c is a constant in practice. Using FEDRR, we performed exhaustive search of all redundant is-a relations in two of the largest ontological systems in biomedicine: SNOMED CT and Gene Ontology (GO). 372 and 1609 redundant is-a relations were found in the 2015-09-01 version of SNOMED CT and 2015-05-01 version of GO, respectively. We have also performed FEDRR on over 190 source vocabularies in the UMLS - a large integrated repository of biomedical ontologies, and identified six sources containing redundant is-a relations. Randomly generated ontologies have also been used to further validate the efficiency of FEDRR. Conclusions: FEDRR provides a generally applicable, effective tool for systematic detecting redundant relations in large ontological systems for quality improvement

    Relationship auditing of the FMA ontology

    Get PDF
    The Foundational Model of Anatomy (FMA) ontology is a domain reference ontology based on a disciplined modeling approach. Due to its large size, semantic complexity and manual data entry process, errors and inconsistencies are unavoidable and might remain within the FMA structure without detection. In this paper, we present computable methods to highlight candidate concepts for various relation- ship assignment errors. The process starts with locating structures formed by transitive structural relationships (part_of, tributary_of, branch_of) and examine their assignments in the context of the IS-A hierarchy. The algorithms were designed to detect five major categories of possible incorrect relationship assignments: circular, mutually exclusive, redundant, inconsistent, and missed entries. A domain expert reviewed samples of these presumptive errors to confirm the findings. Seven thousand and fifty-two presumptive errors were detected, the largest proportion related to part_of relationship assignments. The results highlight the fact that errors are unavoidable in complex ontologies and that well designed algorithms can help domain experts to focus on concepts with high likelihood of errors and maximize their effort to ensure consistency and reliability. In the future similar methods might be integrated with data entry processes to offer real-time error detection

    Designing novel abstraction networks for ontology summarization and quality assurance

    Get PDF
    Biomedical ontologies are complex knowledge representation systems. Biomedical ontologies support interdisciplinary research, interoperability of medical systems, and Electronic Healthcare Record (EHR) encoding. Ontologies represent knowledge using concepts (entities) linked by relationships. Ontologies may contain hundreds of thousands of concepts and millions of relationships. For users, the size and complexity of ontologies make it difficult to comprehend “the big picture” of an ontology\u27s content. For ontology editors, size and complexity make it difficult to uncover errors and inconsistencies. Errors in an ontology will ultimately affect applications that utilize the ontology. In prior studies abstraction networks (AbNs) were developed to provide a compact summary of an ontology\u27s content and structure. AbNs have been shown to successfully support ontology summarization and quality assurance (QA), e.g., for SNOMED CT and NCIt. Despite the success of these previous studies, several major, unaddressed issues affect the applicability and usability of AbNs. This thesis is broken into five major parts, each addressing one issue. The first part of this dissertation addresses the scalability of AbN-based QA techniques to large SNOMED CT hierarchies. Previous studies focused on relatively small hierarchies. The QA techniques developed for these small hierarchies do not scale to large hierarchies, e.g., Procedure and Clinical finding. A new type of AbN, called a subtaxonomy, is introduced to address this problem. Subtaxonomies summarize a subset of an ontology\u27s content. Several types of subtaxonomies and subtaxonomy-based QA studies are discussed. The second part of this dissertation addresses the need for summarization and QA methods for the twelve SNOMED CT hierarchies with no lateral relationships. Previously developed SNOMED CT AbN derivation methodologies, which require lateral relationships, cannot be applied to these hierarchies. The Tribal Abstraction Network (TAN) is a new type of AbN derived using only hierarchical relationships. A TAN-based QA methodology is introduced and the results of a QA review of the Observable entity hierarchy are reported. The third part focuses on the development of generic AbN derivation methods that are applicable to groups of structurally similar ontologies, e.g., those developed in the Web Ontology Language (OWL) format. Previously, AbN derivation techniques were applicable to only a single ontology at a time. AbNs that are applicable to many OWL ontologies are introduced, a preliminary study on OWL AbN granularity is reported on, and the results of several QA studies are presented. The fourth part describes Diff Abstraction Networks, which summarize and visualize the structural differences between two ontology releases. Diff Area Taxonomy and Diff Partial-area Taxonomy derivation methodologies are introduced and Diff Partial-area taxonomies are derived for three OWL ontologies. The Diff Abstraction Network approach is compared to the traditional ontology diff approach. Lastly, tools for deriving and visualizing AbNs are described. The Biomedical Layout Utility Framework is introduced to support the automatic creation, visualization, and exploration of abstraction networks for SNOMED CT and OWL ontologies
    • …
    corecore