28 research outputs found

    Structural auditing methodologies for controlled terminologies

    Get PDF
    Several auditing methodologies for large controlled terminologies are developed. These are applied to the Unified Medical Language System XXXX and the National Cancer Institute Thesaurus (NCIT). Structural auditing methodologies are based on the structural aspects such as IS-A hierarchy relationships groups of concepts assigned to semantic types and groups of relationships defined for concepts. Structurally uniform groups of concepts tend to be semantically uniform. Structural auditing methodologies focus on concepts with unlikely or rare configuration. These concepts have a high likelihood for errors. One of the methodologies is based on comparing hierarchical relationships between the META and SN, two major knowledge sources of the UMLS. In general, a correspondence between them is expected since the SN hierarchical relationships should abstract the META hierarchical relationships. It may indicate an error when a mismatch occurs. The UMLS SN has 135 categories called semantic types. However, in spite of its medium size, the SN has limited use for comprehension purposes because it cannot be easily represented in a pictorial form, it has many (about 7,000) relationships. Therefore, a higher-level abstraction for the SN called a metaschema, is constructed. Its nodes are meta-semantic types, each representing a connected group of semantic types of the SN. One of the auditing methodologies is based on a kind of metaschema called a cohesive metaschema. The focus is placed on concepts of intersections of meta-semantic types. As is shown, such concepts have high likelihood for errors. Another auditing methodology is based on dividing the NCIT into areas according to the roles of its concepts. Moreover, each multi-rooted area is further divided into pareas that are singly rooted. Each p-area contains a group of structurally and semantically uniform concepts. These groups, as well as two derived abstraction networks called taxonomies, help in focusing on concepts with potential errors. With genomic research being at the forefront of bioscience, this auditing methodology is applied to the Gene hierarchy as well as the Biological Process hierarchy of the NCIT, since processes are very important for gene information. The results support the hypothesis that the occurrence of errors is related to the size of p-areas. Errors are more frequent for small p-areas

    Enriching and designing metaschemas for the UMLS semantic network

    Get PDF
    The disparate terminologies used by various biomedical applications or professionals make the communication between them more difficult. The Unified Medical Language System (UMLS) of the National Library of Medicine (NLM) is an attempt to integrate different medical terminologies into a unified representation framework to improve decision making and the quality of patient care as well as research in the health-care field. Metathesaurus (META) and Semantic Network (SN) are two main components of the UMLS system, where the SN provides a high-level abstract of the concepts in the META. This dissertation addresses three problems of the SN. First, the SN\u27s two-tree structure is restrictive because it does not allow a semantic type to be a specialization of several other semantic types. This restriction leads to the omission of some subsumption knowledge in the SN. Secondly, the SN is large and complex for comprehension purposes and it does not come with a pictorial representation for users. As a partial solution for this problem, several metaschemas were previously built as higher-level abstractions for the SN to help users\u27 orientation. Third, there is no efficient method to evaluate each metaschema. There is no technique to obtain a consolidated metaschema acceptable for a majority of the UMLS\u27s users. In this dissertation work the author attacked the described problems by using the following approaches. (1) The SN was expanded into the Enriched Semantic Network (ESN), a multiple subsumption structure with a directed acyclic graph (DAG) IS-A hierarchy, allowing a semantic type to have multiple parents. New viable IS-A links were added as warranted. Two methodologies were presented to identify and add new viable IS-A links. The ESN serves as an extended high-level abstract of the META. (2) The ESN\u27s semantic relationship distribution and concept configuration were studied. Rules were defined to derive the ESN\u27s semantic relationship distribution from the current SN\u27s semantic relationship distribution. A mapping function was defined to map the SN\u27s concept configuration to the ESN\u27s concept configuration, avoiding redundant classifications in the ESN\u27s concept configuration. (3) Several new metaschemas for the SN and the ESN were built and evaluated based on several different partitioning techniques. Each of these metaschema can serve as a higher-level abstraction of the SN (or the ESN)

    Semantic reclassification of the UMLS concepts

    Get PDF
    Summary: Accurate semantic classification is valuable for text mining and knowledge-based tasks that perform inference based on semantic classes. To benefit applications using the semantic classification of the Unified Medical Language System (UMLS) concepts, we automatically reclassified the concepts based on their lexical and contextual features. The new classification is useful for auditing the original UMLS semantic classification and for building biomedical text mining applications

    A chemical specialty semantic network for the Unified Medical Language System

    Get PDF
    Background Terms representing chemical concepts found the Unified Medical Language System (UMLS) are used to derive an expanded semantic network with mutually exclusive semantic types. The UMLS Semantic Network (SN) is composed of a collection of broad categories called semantic types (STs) that are assigned to concepts. Within the UMLS’s coverage of the chemical domain, we find a great deal of concepts being assigned more than one ST. This leads to the situation where the extent of a given ST may contain concepts elaborating variegated semantics. A methodology for expanding the chemical subhierarchy of the SN into a finer-grained categorization of mutually exclusive types with semantically uniform extents is presented. We call this network a Chemical Specialty Semantic Network (CSSN). A CSSN is derived automatically from the existing chemical STs and their assignments. The methodology incorporates a threshold value governing the minimum size of a type’s extent needed for inclusion in the CSSN. Thus, different CSSNs can be created by choosing different threshold values based on varying requirements. Results A complete CSSN is derived using a threshold value of 300 and having 68 STs. It is used effectively to provide high-level categorizations for a random sample of compounds from the “Chemical Entities of Biological Interest” (ChEBI) ontology. The effect on the size of the CSSN using various threshold parameter values between one and 500 is shown. Conclusions The methodology has several potential applications, including its use to derive a pre-coordinated guide for ST assignments to new UMLS chemical concepts, as a tool for auditing existing concepts, inter-terminology mapping, and to serve as an upper-level network for ChEBI

    Relationship auditing of the FMA ontology

    Get PDF
    The Foundational Model of Anatomy (FMA) ontology is a domain reference ontology based on a disciplined modeling approach. Due to its large size, semantic complexity and manual data entry process, errors and inconsistencies are unavoidable and might remain within the FMA structure without detection. In this paper, we present computable methods to highlight candidate concepts for various relation- ship assignment errors. The process starts with locating structures formed by transitive structural relationships (part_of, tributary_of, branch_of) and examine their assignments in the context of the IS-A hierarchy. The algorithms were designed to detect five major categories of possible incorrect relationship assignments: circular, mutually exclusive, redundant, inconsistent, and missed entries. A domain expert reviewed samples of these presumptive errors to confirm the findings. Seven thousand and fifty-two presumptive errors were detected, the largest proportion related to part_of relationship assignments. The results highlight the fact that errors are unavoidable in complex ontologies and that well designed algorithms can help domain experts to focus on concepts with high likelihood of errors and maximize their effort to ensure consistency and reliability. In the future similar methods might be integrated with data entry processes to offer real-time error detection

    Using structural and semantic methodologies to enhance biomedical terminologies

    Get PDF
    Biomedical terminologies and ontologies underlie various Health Information Systems (HISs), Electronic Health Record (EHR) Systems, Health Information Exchanges (HIEs) and health administrative systems. Moreover, the proliferation of interdisciplinary research efforts in the biomedical field is fueling the need to overcome terminological barriers when integrating knowledge from different fields into a unified research project. Therefore well-developed and well-maintained terminologies are in high demand. Most of the biomedical terminologies are large and complex, which makes it impossible for human experts to manually detect and correct all errors and inconsistencies. Automated and semi-automated Quality Assurance methodologies that focus on areas that are more likely to contain errors and inconsistencies are therefore important. In this dissertation, structural and semantic methodologies are used to enhance biomedical terminologies. The dissertation work is divided into three major parts. The first part consists of structural auditing techniques for the Semantic Network of the Unified Medical Language System (UMLS), which serves as a vocabulary knowledge base for biomedical research in various applications. Research techniques are presented on how to automatically identify and prevent erroneous semantic type assignments to concepts. The Web-based adviseEditor system is introduced to help UMLS editors to make correct multiple semantic type assignments to concepts. It is made available to the National Library of Medicine for future use in maintaining the UMLS. The second part of this dissertation is on how to enhance the conceptual content of SNOMED CT by methods of semantic harmonization. By 2015, SNOMED will become the standard terminology for EH R encoding of diagnoses and problem lists. In order to enrich the semantics and coverage of SNOMED CT for clinical and research applications, the problem of semantic harmonization between SNOMED CT and six reference terminologies is approached by 1) comparing the vertical density of SNOM ED CT with the reference terminologies to find potential concepts for export and import; and 2) categorizing the relationships between structurally congruent concepts from pairs of terminologies, with SNOMED CT being one terminology in the pair. Six kinds of configurations are observed, e.g., alternative classifications, and suggested synonyms. For each configuration, a corresponding solution is presented for enhancing one or both of the terminologies. The third part applies Quality Assurance techniques based on “Abstraction Networks” to biomedical ontologies in BioPortal. The National Center for Biomedical Ontology provides B ioPortal as a repository of over 350 biomedical ontologies covering a wide range of domains. It is extremely difficult to design a new Quality Assurance methodology for each ontology in BioPortal. Fortunately, groups of ontologies in BioPortal share common structural features. Thus, they can be grouped into families based on combinations of these features. A uniform Quality Assurance methodology design for each family will achieve improved efficiency, which is critical with the limited Quality Assurance resources available to most ontology curators. In this dissertation, a family-based framework covering 186 BioPortal ontologies and accompanying Quality Assurance methods based on abstraction networks are presented to tackle this problem

    Structural analysis and auditing of SNOMED hierarchies using abstraction networks

    Get PDF
    SNOMED is one of the leading healthcare terminologies being used worldwide. Due to its sheer volume and continuing expansion, it is inevitable that errors will make their way into SNOMED. Thus, quality assurance is an important part of its maintenance cycle. A structural approach is presented in this dissertation, aiming at developing automated techniques that can aid auditors in the discovery of terminology errors more effectively and efficiently. Large SNOMED hierarchies are partitioned, based primarily on their relationships patterns, into concept groups of more manageable sizes. Three related abstraction networks with respect to a SNOMED hierarchy, namely the area taxonomy, partial-area taxonomy, and disjoint partial-area taxonomy, are derived programmatically from the partitions. Altogether they afford high-level abstraction views of the underlying hierarchy, each with different granularity. The area taxonomy gives a global structural view of a SNOMED hierarchy, while the partial-area taxonomy focuses more on the semantic uniformity and hierarchical proximity of concepts. The disjoint partial-area taxonomy is devised as an enhancement of the partial-area taxonomy and is based on the partition of the entire collection of so-called overlapping concepts into singly-rooted groups. The taxonomies are exploited as the basis for a number of systematic auditing regimens, with a theme that complex concepts are more error-prone and require special attention in auditing activities. In general, group-based auditing is promoted to achieve a more efficient review within semantically uniform groups. Certain concept groups in the different taxonomies are deemed “complex” according to various criteria and thus deserve focused auditing. Examples of these include strict inheritance regions in the partial-area taxonomy and overlapping partial-areas in the disjoint partial-area taxonomy. Multiple hypotheses are formulated to characterize the error distributions and ratios with respect to different concept groups presented by the taxonomies, and thus further establish their efficacy as vehicles for auditing. The methodologies are demonstrated using SNOMED’s Specimen hierarchy as the test bed. Auditing results are reported and analyzed to assess the hypotheses. With the use of the double bootstrap and Fisher’s exact test (two-tailed), the aforementioned hypotheses are confirmed. Auditing on various complex concept groups based on the taxonomies is shown to yield a statistically significant higher proportion of errors

    Extensions of SNOMED taxonomy abstraction networks supporting auditing and complexity analysis

    Get PDF
    The Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT) has been widely used as a standard terminology in various biomedical domains. The enhancement of the quality of SNOMED contributes to the improvement of the medical systems that it supports. In previous work, the Structural Analysis of Biomedical Ontologies Center (SABOC) team has defined the partial-area taxonomy, a hierarchical abstraction network consisting of units called partial-areas. Each partial-area comprises a set of SNOMED concepts exhibiting a particular relationship structure and being distinguished by a unique root concept. In this dissertation, some extensions and applications of the taxonomy framework are considered. Some concepts appearing in multiple partial-areas have been designated as complex due to the fact that they constitute a tangled portion of a hierarchy and can be obstacles to users trying to gain an understanding of the hierarchy’s content. A methodology for partitioning the entire collection of these so-called overlapping complex concepts into singly-rooted groups was presented. A novel auditing methodology based on an enhanced abstraction network is described. In addition, the existing abstraction network relies heavily on the structure of the outgoing relationships of the concepts. But some of SNOMED hierarchies (or subhierarchies) serve only as targets of relationships, with few or no outgoing relationships of their own. This situation impedes the applicability of the abstraction network. To deal with this problem, a variation of the above abstraction network, called the converse abstraction network (CAN) is defined and derived automatically from a given SNOMED hierarchy. An auditing methodology based on the CAN is formulated. Furthermore, a preliminary study of the complementary use of the abstraction network in description logic (DL) for quality assurance purposes pertaining to SNOMED is presented. Two complexity measures, a structural complexity measure and a hierarchical complexity measure, based on the abstraction network are introduced to quantify the complexity of a SNOMED hierarchy. An extension of the two measures is also utilized specifically to track the complexity of the versions of the SNOMED hierarchies before and after a sequence of auditing processes

    Designing novel abstraction networks for ontology summarization and quality assurance

    Get PDF
    Biomedical ontologies are complex knowledge representation systems. Biomedical ontologies support interdisciplinary research, interoperability of medical systems, and Electronic Healthcare Record (EHR) encoding. Ontologies represent knowledge using concepts (entities) linked by relationships. Ontologies may contain hundreds of thousands of concepts and millions of relationships. For users, the size and complexity of ontologies make it difficult to comprehend “the big picture” of an ontology\u27s content. For ontology editors, size and complexity make it difficult to uncover errors and inconsistencies. Errors in an ontology will ultimately affect applications that utilize the ontology. In prior studies abstraction networks (AbNs) were developed to provide a compact summary of an ontology\u27s content and structure. AbNs have been shown to successfully support ontology summarization and quality assurance (QA), e.g., for SNOMED CT and NCIt. Despite the success of these previous studies, several major, unaddressed issues affect the applicability and usability of AbNs. This thesis is broken into five major parts, each addressing one issue. The first part of this dissertation addresses the scalability of AbN-based QA techniques to large SNOMED CT hierarchies. Previous studies focused on relatively small hierarchies. The QA techniques developed for these small hierarchies do not scale to large hierarchies, e.g., Procedure and Clinical finding. A new type of AbN, called a subtaxonomy, is introduced to address this problem. Subtaxonomies summarize a subset of an ontology\u27s content. Several types of subtaxonomies and subtaxonomy-based QA studies are discussed. The second part of this dissertation addresses the need for summarization and QA methods for the twelve SNOMED CT hierarchies with no lateral relationships. Previously developed SNOMED CT AbN derivation methodologies, which require lateral relationships, cannot be applied to these hierarchies. The Tribal Abstraction Network (TAN) is a new type of AbN derived using only hierarchical relationships. A TAN-based QA methodology is introduced and the results of a QA review of the Observable entity hierarchy are reported. The third part focuses on the development of generic AbN derivation methods that are applicable to groups of structurally similar ontologies, e.g., those developed in the Web Ontology Language (OWL) format. Previously, AbN derivation techniques were applicable to only a single ontology at a time. AbNs that are applicable to many OWL ontologies are introduced, a preliminary study on OWL AbN granularity is reported on, and the results of several QA studies are presented. The fourth part describes Diff Abstraction Networks, which summarize and visualize the structural differences between two ontology releases. Diff Area Taxonomy and Diff Partial-area Taxonomy derivation methodologies are introduced and Diff Partial-area taxonomies are derived for three OWL ontologies. The Diff Abstraction Network approach is compared to the traditional ontology diff approach. Lastly, tools for deriving and visualizing AbNs are described. The Biomedical Layout Utility Framework is introduced to support the automatic creation, visualization, and exploration of abstraction networks for SNOMED CT and OWL ontologies
    corecore