51 research outputs found

    Ontology-based Semantic Harmonization of HIV-associated Common Data Elements for Integration of Diverse HIV Research Datasets

    Get PDF
    Analysis of integrated, diverse, Human Immunodeficiency Virus (HIV)-associated datasets can increase knowledge and guide the development of novel and effective interventions for disease prevention and treatment by increasing breadth of variables and statistical power, particularly for sub-group analyses. This topic has been identified as a National Institutes of Health research priority, but few efforts have been made to integrate data across HIV studies. Our aims were to: 1) Characterize the semantic heterogeneity (SH) in the HIV research domain; 2) Identify HIV-associated common data elements (CDEs) in empirically generated and knowledge-based resources; 3) Create a formal representation of HIV-associated CDEs in the form of an HIV-associated Entities in Research Ontology (HERO); 4) Assess the feasibility of using HERO to semantically harmonize HIV research data. Our approach was guided by information/knowledge theory and the DIKW (Data Information Knowledge Wisdom) hierarchical model. Our systematized review of the literature revealed that synergistic use of both ontologies and CDEs included integration, interoperability, data exchange, and data standardization. Moreover, methods and tools included use of experts for CDE identification, the Unified Medical Language System, natural language processing, Extensible Markup Language, Health Level 7, and ontology development tools (e.g., Protégé). Additionally, evaluation methods included expert assessment, quantification of mapping tasks between raters, assessment of interrater reliability, and comparison to established standards. We used these findings to inform our process for achieving the study aims. For Aim 1, we analyzed eight disparate HIV-associated data dictionaries and developed a String Metric-assisted Assessment of Semantic Heterogeneity (SMASH) method, which aided identification of 127 (13%) homogeneous data element (DE) pairs and 1,048 (87%) semantically heterogeneous DE pairs. Most heterogeneous pairs (97%) were semantically-equivalent/syntactically-different, allowing us to determine that SH in the HIV research domain was high. To achieve Aim 2, we used Clinicaltrials.gov, Google Search, and text mining in R to identify HIV-associated CDEs in HIV journal articles, HIV-associated datasets, AIDSinfo HIV/AIDS Glossary, AIDSinfo Drug Database, Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), and RxNORM (understood as prescription normalization). Two HIV experts then manually reviewed DEs from the journal articles and data dictionaries to confirm DE commonality and resolved semantic discrepancies through discussion. Ultimately, we identified 2,179 unique CDEs. Of all CDEs, data-driven approaches identified 2,055 (94%) (999 from the HIV/AIDS Glossary, 398 from the Drug Database, 91 from journal articles, and a total of 567 from LOINC, SNOMED, and RxNorm cumulatively). Expert-based approaches identified 124 (6%) unique CDEs from data dictionaries and confirmed the 91 CDEs from journal articles. In Aim 3, we used the Protégé suite of ontology development tools and the 2,179 CDEs to develop the HERO. We modeled the ontology using the semantic structure of the Medical Entities Dictionary, available hierarchical information from the CDE knowledge resources, and expert knowledge. The ontology fulfilled most relevant criteria from Cimino’s desiderata and OntoClean ontology engineering principles, and it successfully answered eight competency questions. Finally, for Aim 4, we assessed the feasibility of using HERO to semantically harmonize and integrate the data dictionaries from two diverse HIV-associated datasets. Two HIV experts involved in the development of HERO independently assessed each data dictionary. Of the 367 DEs in data dictionary 1 (D1), 181 (49.32%) were identified as CDEs and 186 (50.68%) were not CDEs, and of the 72 DEs in data dictionary 2 (D2), 37 (51.39%) were CDEs and 35 (48.61%) were not CDEs. The HIV experts then traversed HERO’s hierarchy to map CDEs from D1 and D2 to CDEs in HERO. Of the 181 CDEs in D1, 156 (86.19%) were found in HERO, and 25 (13.81%) were not. Similarly, of the 37 CDEs in D2 32 (86.48%) were found in HERO, and 5 (13.51%) were not. Interrater reliability for CDE identification as measured by Cohen’s Kappa was 0.900 for D1 and 0.892 for D2. Cohen’s Kappas for CDEs in D1 and D2 that were also identified in HERO were 0.885 and 0.688, respectively. Subsequently, to demonstrate the integration of the two HIV-associated datasets, a sample of semantically harmonized CDEs in both datasets was categorically selected (e.g. administrative, demographic, and behavioral), and D2 sample size increases were calculated for race (e.g., White, African American/Black, Asian/Pacific Islander, Native American/Indian, and Hispanic/Latino) and for “intravenous drug use” from the integrated datasets. The average increase of D2 CDEs for six selected CDEs was 1,928%. Despite the limitation of HERO developers also serving as evaluators, the contributions of the study to the fields of informatics and HIV research were substantial. Confirmatory contributions include: identification of effective CDE/ontology tools, and use of data-driven and expert-based methods. Novel contributions include: development of SMASH and HERO; and new contributions include documenting that SH is high in HIV-associated datasets, identifying 2,179 HIV-associated CDEs, creating two additional classifications of SH, and showing that using HERO for semantic harmonization of HIV-associated data dictionaries is feasible. Our future work will build upon this research by expanding the numbers and types of datasets, refining our methods and tools, and conducting an external evaluation

    Global Academic Competencies for Health Information Professionals

    Get PDF

    Health systems data interoperability and implementation

    Get PDF
    Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers. Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study. Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus. Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration. Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing

    MS

    Get PDF
    thesisDelivery of high quality health care requires access to complete and accurate patient information. Variation in data context and content across disparate clinical systems adversely affects the integration of information needed for effective patient care and outcomes research. This study detects the extent and nature of data variation across three disparate clinical systems used along different points of the perinatal care continuum at Intermountain Health Care (IHC). Three analytical methods were used to examine data variation: data structure analysis; clinician perception of missing data elements; and patient record review of key data values. Knowledge acquisition techniques and consensus among clinical domain experts were used to select sample data elements for the data structure analysis. Findings revealed only 17% of the sample data elements had ompatible structure and meaning across the prenatal, labor and delivery (L&D), and newborn intensive care (NICU) clinical data systems. Impact on clinician efficiency from missing and contradicting information in nonintegrated perinatal systems was captured and analyzed using a Critical Incident Technique-based clinician survey. In a 1-month period, 75% of responding clinicians reported missing data and 34% reported contradicting data. The time taken to resolve problems from 1 month's missing data was estimated to be 231 hours for 23 clinicians. Data values from patient records for eight laboratory results were compared across the three perinatal systems. The best match across any two systems was 88% (blood type) and the worst was 0% (antibody screen, chlamydia). The highest incidence of contradicting data was 2.5% for blood type. Comparing agreement of the three methods, triangulation,"" gave additional insight into IHC's data variation problem. The data model study and the patient record review study showed missing data element problems beyond what clinicians perceived. In all, the consistency of data capture in the three perinatal systems at IHC is worse than expected. The data necessary to computationally execute the logic of the perinatal care process models is intermittent and unreliable. Rework of the perinatal applications based on a uniform data model and standard terminologies will provide an infrastructure to achieve IHC's vision of interdisciplinary care."

    Clinical foundations and information architecture for the implementation of a federated health record service

    Get PDF
    Clinical care increasingly requires healthcare professionals to access patient record information that may be distributed across multiple sites, held in a variety of paper and electronic formats, and represented as mixtures of narrative, structured, coded and multi-media entries. A longitudinal person-centred electronic health record (EHR) is a much-anticipated solution to this problem, but its realisation is proving to be a long and complex journey. This Thesis explores the history and evolution of clinical information systems, and establishes a set of clinical and ethico-legal requirements for a generic EHR server. A federation approach (FHR) to harmonising distributed heterogeneous electronic clinical databases is advocated as the basis for meeting these requirements. A set of information models and middleware services, needed to implement a Federated Health Record server, are then described, thereby supporting access by clinical applications to a distributed set of feeder systems holding patient record information. The overall information architecture thus defined provides a generic means of combining such feeder system data to create a virtual electronic health record. Active collaboration in a wide range of clinical contexts, across the whole of Europe, has been central to the evolution of the approach taken. A federated health record server based on this architecture has been implemented by the author and colleagues and deployed in a live clinical environment in the Department of Cardiovascular Medicine at the Whittington Hospital in North London. This implementation experience has fed back into the conceptual development of the approach and has provided "proof-of-concept" verification of its completeness and practical utility. This research has benefited from collaboration with a wide range of healthcare sites, informatics organisations and industry across Europe though several EU Health Telematics projects: GEHR, Synapses, EHCR-SupA, SynEx, Medicate and 6WINIT. The information models published here have been placed in the public domain and have substantially contributed to two generations of CEN health informatics standards, including CEN TC/251 ENV 13606

    Preface

    Get PDF

    ACLRO: An Ontology for the Best Practice in ACLR Rehabilitation

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)With the rise of big data and the demands for leveraging artificial intelligence (AI), healthcare requires more knowledge sharing that offers machine-readable semantic formalization. Even though some applications allow shared data interoperability, they still lack formal machine-readable semantics in ICD9/10 and LOINC. With ontology, the further ability to represent the shared conceptualizations is possible, similar to SNOMED-CT. Nevertheless, SNOMED-CT mainly focuses on electronic health record (EHR) documenting and evidence-based practice. Moreover, due to its independence on data quality, the ontology enhances advanced AI technologies, such as machine learning (ML), by providing a reusable knowledge framework. Developing a machine-readable and sharable semantic knowledge model incorporating external evidence and individual practice’s values will create a new revolution for best practice medicine. The purpose of this research is to implement a sharable ontology for the best practice in healthcare, with anterior cruciate ligament reconstruction (ACLR) as a case study. The ontology represents knowledge derived from both evidence-based practice (EBP) and practice-based evidence (PBE). First, the study presents how the domain-specific knowledge model is built using a combination of Toronto Virtual Enterprise (TOVE) and a bottom-up approach. Then, I propose a top-down approach using Open Biological and Biomedical Ontology (OBO) Foundry ontologies that adheres to the Basic Formal Ontology (BFO)’s framework. In this step, the EBP, PBE, and statistic ontologies are developed independently. Next, the study integrates these individual ontologies into the final ACLR Ontology (ACLRO) as a more meaningful model that endorses the reusability and the ease of the model-expansion process since the classes can grow independently from one another. Finally, the study employs a use case and DL queries for model validation. The study's innovation is to present the ontology implementation for best-practice medicine and demonstrate how it can be applied to a real-world setup with semantic information. The ACLRO simultaneously emphasizes knowledge representation in health-intervention, statistics, research design, and external research evidence, while constructing the classes of data-driven and patient-focus processes that allow knowledge sharing explicit of technology. Additionally, the model synthesizes multiple related ontologies, which leads to the successful application of best-practice medicine
    • …
    corecore