20 research outputs found

    STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

    Get PDF
    Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56

    Planiranje višejezične baze podataka za nazivlje iz područja visokoga obrazovanja

    Get PDF
    The paper aims to study European and Hungarian organisations and institutions that are related to the terminology of education. Then we analyse glossaries, dictionaries and databases that can be found online at the webpages of UNESCO and the European Union, and also those that contain education terminology in Hungarian (online and offline). Finally, we are going to introduce our planned database. The terminology of education is a key area at the national level and in the context of the European Union equally. There are existing word lists, glossaries and dictionaries in certain languages that contain the terminology of education in one or more languages. Our aim is to design and prepare a multilingual terminology database in the field of education terminology. The languages we plan to work with are Hungarian, English, and the official languages (Romanian, Slovak, Ukrainian, Croatian, German, Serbian, Slovenian) of the territories in the neighbouring countries where there is a substantial Hungarian minority, who attend school either in the official language of that country or in Hungarian.Cilj je rada proučiti europske i mađarske organizacije i ustanove koje su povezane s nazivljem u području obrazovanja te nakon toga analizirati glosare, rječnike i baze podataka koji se mogu pronaći na internetskim stranicama UNESCO-a i Europske unije kao i one koje sadržavaju nazivlje iz područja obrazovanja na mađarskom (u tiskanom obliku i u mrežnim vrelima). Naposljetku se govori o planiranoj bazi podataka. Nazivlje iz područja obrazovanja jednako je ključno područje i na nacionalnoj razini i u kontekstu Europske unije. Neki jezici imaju popise riječi, glosare i rječnike koji navode to nazivlje na jednom jeziku ili na više njih. Naš je cilj osmisliti i prirediti višejezičnu terminološku bazu s nazivljem iz područja obrazovanja. Jezici koje namjeravamo uključiti jesu mađarski, engleski i službeni jezici s područja susjednih država u kojima živi znatan broj pripadnika mađarske manjine (rumunjski, slovački, ukrajinski, hrvatski, njemački, srpski i slovenski) koji pohađaju nastavu ili na mađarskom ili na službenom jeziku države u kojoj žive

    Mining Non-Lattice Subgraphs for Detecting Missing Hierarchical Relations and Concepts in SNOMED CT

    Get PDF
    Objective: Quality assurance of large ontological systems such as SNOMED CT is an indispensable part of the terminology management lifecycle. We introduce a hybrid structural-lexical method for scalable and systematic discovery of missing hierarchical relations and concepts in SNOMED CT. Material and Methods: All non-lattice subgraphs (the structural part) in SNOMED CT are exhaustively extracted using a scalable MapReduce algorithm. Four lexical patterns (the lexical part) are identified among the extracted non-lattice subgraphs. Non-lattice subgraphs exhibiting such lexical patterns are often indicative of missing hierarchical relations or concepts. Each lexical pattern is associated with a potential specific type of error. Results: Applying the structural-lexical method to SNOMED CT (September 2015 US edition), we found 6801 non-lattice subgraphs that matched these lexical patterns, of which 2046 were amenable to visual inspection. We evaluated a random sample of 100 small subgraphs, of which 59 were reviewed in detail by domain experts. All the subgraphs reviewed contained errors confirmed by the experts. The most frequent type of error was missing is-a relations due to incomplete or inconsistent modeling of the concepts. Conclusions: Our hybrid structural-lexical method is innovative and proved effective not only in detecting errors in SNOMED CT, but also in suggesting remediation for these errors

    Progress Notes

    Get PDF
    https://scholarlyworks.lvhn.org/progress_notes/1270/thumbnail.jp

    Development of an Artificial Intelligence Method for the Analysis of Bloodstain Patterns

    Get PDF
    Bloodstain Pattern Analysis (BPA) is a forensic discipline that plays a crucial role in reconstructing the events at a crime scene (Acampora, 2014). The shape, size, distribution, and location of bloodstains can help infer the potential murder weapon, the origin of the attack, and if the body has been moved or relocated from the original crime scene. Commonly, errors in identifying blood spatter evidence arise when the crime scene has large amounts of bloodstains which can yield less information during analysis. This study aims to utilize artificial intelligence (A.I.) algorithms to assist the analyst in the analysis of bloodstain patterns. To date, BPA relies on a manual analysis process; therefore, it is imperative to have forensic analysts who can accurately produce reliable results (Hoelz, 2009). However, human error is unavoidable, and analyst error can result in inaccurate conclusions that can jeopardize casework. The President's Council of Advisors on Science and Technology (PCAST) report on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods brought to light the shortcomings of many forensic disciplines, including BPA. To improve the field of BPA, automated and computer-assisted methods of analysis are needed. In this study, we used A.I. to estimate the angle of impact from simulated crime scene samples. Our A.I.-assisted approach was determined to be accurate for 78.64% of all data analyzed. This study focused on the analysis of photos taken from a single impact angle as the primary input data. Bloodstain patterns were experimentally constructed using controlled conditions, and a single variable altered at a time

    Structural auditing methodologies for controlled terminologies

    Get PDF
    Several auditing methodologies for large controlled terminologies are developed. These are applied to the Unified Medical Language System XXXX and the National Cancer Institute Thesaurus (NCIT). Structural auditing methodologies are based on the structural aspects such as IS-A hierarchy relationships groups of concepts assigned to semantic types and groups of relationships defined for concepts. Structurally uniform groups of concepts tend to be semantically uniform. Structural auditing methodologies focus on concepts with unlikely or rare configuration. These concepts have a high likelihood for errors. One of the methodologies is based on comparing hierarchical relationships between the META and SN, two major knowledge sources of the UMLS. In general, a correspondence between them is expected since the SN hierarchical relationships should abstract the META hierarchical relationships. It may indicate an error when a mismatch occurs. The UMLS SN has 135 categories called semantic types. However, in spite of its medium size, the SN has limited use for comprehension purposes because it cannot be easily represented in a pictorial form, it has many (about 7,000) relationships. Therefore, a higher-level abstraction for the SN called a metaschema, is constructed. Its nodes are meta-semantic types, each representing a connected group of semantic types of the SN. One of the auditing methodologies is based on a kind of metaschema called a cohesive metaschema. The focus is placed on concepts of intersections of meta-semantic types. As is shown, such concepts have high likelihood for errors. Another auditing methodology is based on dividing the NCIT into areas according to the roles of its concepts. Moreover, each multi-rooted area is further divided into pareas that are singly rooted. Each p-area contains a group of structurally and semantically uniform concepts. These groups, as well as two derived abstraction networks called taxonomies, help in focusing on concepts with potential errors. With genomic research being at the forefront of bioscience, this auditing methodology is applied to the Gene hierarchy as well as the Biological Process hierarchy of the NCIT, since processes are very important for gene information. The results support the hypothesis that the occurrence of errors is related to the size of p-areas. Errors are more frequent for small p-areas

    Extensions of SNOMED taxonomy abstraction networks supporting auditing and complexity analysis

    Get PDF
    The Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT) has been widely used as a standard terminology in various biomedical domains. The enhancement of the quality of SNOMED contributes to the improvement of the medical systems that it supports. In previous work, the Structural Analysis of Biomedical Ontologies Center (SABOC) team has defined the partial-area taxonomy, a hierarchical abstraction network consisting of units called partial-areas. Each partial-area comprises a set of SNOMED concepts exhibiting a particular relationship structure and being distinguished by a unique root concept. In this dissertation, some extensions and applications of the taxonomy framework are considered. Some concepts appearing in multiple partial-areas have been designated as complex due to the fact that they constitute a tangled portion of a hierarchy and can be obstacles to users trying to gain an understanding of the hierarchy’s content. A methodology for partitioning the entire collection of these so-called overlapping complex concepts into singly-rooted groups was presented. A novel auditing methodology based on an enhanced abstraction network is described. In addition, the existing abstraction network relies heavily on the structure of the outgoing relationships of the concepts. But some of SNOMED hierarchies (or subhierarchies) serve only as targets of relationships, with few or no outgoing relationships of their own. This situation impedes the applicability of the abstraction network. To deal with this problem, a variation of the above abstraction network, called the converse abstraction network (CAN) is defined and derived automatically from a given SNOMED hierarchy. An auditing methodology based on the CAN is formulated. Furthermore, a preliminary study of the complementary use of the abstraction network in description logic (DL) for quality assurance purposes pertaining to SNOMED is presented. Two complexity measures, a structural complexity measure and a hierarchical complexity measure, based on the abstraction network are introduced to quantify the complexity of a SNOMED hierarchy. An extension of the two measures is also utilized specifically to track the complexity of the versions of the SNOMED hierarchies before and after a sequence of auditing processes

    Quality management in a private speech-language therapy practice

    Get PDF
    This study investigated the principles of quality management and their application to a private speech-language therapy practice. The history of quality management and the development of quality management in industry and health care services were reviewed. Quality was defined in terms of the context of the author's private speech-language therapy practice and a working definition of quality was developed. The principles in the development of a quality management programme were described. These principles were used to develop and implement a quality management programme in the author's private speech-language therapy practice. Financial management and client satisfaction were selected as strategic quality factors in the initial stages of the quality management programme. Practice policies were revised to establish success criteria and to measure the practice's conformance to these criteria. The quality management programme enabled the author to improve the quality and effectiveness of her practice's financial management system and to demonstrate the client-centered orientation of the practice by implementing client satisfaction as a quality indicator
    corecore