184 research outputs found

    Challenges and opportunities beyond structured data in analysis of electronic health records

    Get PDF
    Electronic health records (EHR) contain a lot of valuable information about individual patients and the whole population. Besides structured data, unstructured data in EHRs can provide extra, valuable information but the analytics processes are complex, time-consuming, and often require excessive manual effort. Among unstructured data, clinical text and images are the two most popular and important sources of information. Advanced statistical algorithms in natural language processing, machine learning, deep learning, and radiomics have increasingly been used for analyzing clinical text and images. Although there exist many challenges that have not been fully addressed, which can hinder the use of unstructured data, there are clear opportunities for well-designed diagnosis and decision support tools that efficiently incorporate both structured and unstructured data for extracting useful information and provide better outcomes. However, access to clinical data is still very restricted due to data sensitivity and ethical issues. Data quality is also an important challenge in which methods for improving data completeness, conformity and plausibility are needed. Further, generalizing and explaining the result of machine learning models are important problems for healthcare, and these are open challenges. A possible solution to improve data quality and accessibility of unstructured data is developing machine learning methods that can generate clinically relevant synthetic data, and accelerating further research on privacy preserving techniques such as deidentification and pseudonymization of clinical text

    Towards the development of data governance standards for using clinical free-text data in health research: a position paper

    Get PDF
    Background: Free-text clinical data (such as outpatient letters or nursing notes) represent a vast, untapped source of rich information that, if more accessible for research, would clarify and supplement information coded in structured data fields. Data usually need to be de-identified or anonymised before they can be reused for research, but there is a lack of established guidelines to govern effective de-identification and use of free-text information and avoid damaging data utility as a by-product. / Objective: We set out to work towards data governance standards to integrate with existing frameworks for personal data use, to enable free-text data to be used safely for research for patient/public benefit. / Methods: We outlined (UK) data protection legislation and regulations for context, and conducted a rapid literature review and UK-based case studies to explore data governance models used in working with free-text data. We also engaged with stakeholders including text mining researchers and the general public to explore perceived barriers and solutions in working with clinical free-text. / Results: We propose a set of recommendations, including the need: for authoritative guidance on data governance for the reuse of free-text data; to ensure public transparency in data flows and uses; to treat de-identified free-text as potentially identifiable with use limited to accredited data safe-havens; and, to commit to a culture of continuous improvement to understand the relationships between efficacy of de-identification and re-identification risks, so this can be communicated to all stakeholders. / Conclusions: By drawing together the findings of a combination of activities, our unique study has added new knowledge towards the development of data governance standards for the reuse of clinical free-text data for secondary purposes. Whilst working in accord with existing data governance frameworks, there is a need for further work to take forward the recommendations we have proposed, with commitment and investment, to assure and expand the safe reuse of clinical free-text data for public benefit

    Understanding Views Around the Creation of a Consented, Donated Databank of Clinical Free Text to Develop and Train Natural Language Processing Models for Research: Focus Group Interviews With Stakeholders

    Get PDF
    BACKGROUND: Information stored within electronic health records is often recorded as unstructured text. Special computerized natural language processing (NLP) tools are needed to process this text; however, complex governance arrangements make such data in the National Health Service hard to access, and therefore, it is difficult to use for research in improving NLP methods. The creation of a donated databank of clinical free text could provide an important opportunity for researchers to develop NLP methods and tools and may circumvent delays in accessing the data needed to train the models. However, to date, there has been little or no engagement with stakeholders on the acceptability and design considerations of establishing a free-text databank for this purpose. OBJECTIVE: This study aimed to ascertain stakeholder views around the creation of a consented, donated databank of clinical free text to help create, train, and evaluate NLP for clinical research and to inform the potential next steps for adopting a partner-led approach to establish a national, funded databank of free text for use by the research community. METHODS: Web-based in-depth focus group interviews were conducted with 4 stakeholder groups (patients and members of the public, clinicians, information governance leads and research ethics members, and NLP researchers). RESULTS: All stakeholder groups were strongly in favor of the databank and saw great value in creating an environment where NLP tools can be tested and trained to improve their accuracy. Participants highlighted a range of complex issues for consideration as the databank is developed, including communicating the intended purpose, the approach to access and safeguarding the data, who should have access, and how to fund the databank. Participants recommended that a small-scale, gradual approach be adopted to start to gather donations and encouraged further engagement with stakeholders to develop a road map and set of standards for the databank. CONCLUSIONS: These findings provide a clear mandate to begin developing the databank and a framework for stakeholder expectations, which we would aim to meet with the databank delivery

    Personal information privacy: what's next?

    Get PDF
    In recent events, user-privacy has been a main focus for all technological and data-holding companies, due to the global interest in protecting personal information. Regulations like the General Data Protection Regulation (GDPR) set firm laws and penalties around the handling and misuse of user data. These privacy rules apply regardless of the data structure, whether it being structured or unstructured. In this work, we perform a summary of the available algorithms for providing privacy in structured data, and analyze the popular tools that handle privacy in textual data; namely medical data. We found that although these tools provide adequate results in terms of de-identifying medical records by removing personal identifyers (HIPAA PHI), they fall short in terms of being generalizable to satisfy nonmedical fields. In addition, the metrics used to measure the performance of these privacy algorithms don't take into account the differences in significance that every identifier has. Finally, we propose the concept of a domain-independent adaptable system that learns the significance of terms in a given text, in terms of person identifiability and text utility, and is then able to provide metrics to help find a balance between user privacy and data usability

    Privacy-Preserving Predictive Modeling: Harmonization of Contextual Embeddings From Different Sources

    Get PDF
    Background: Data sharing has been a big challenge in biomedical informatics because of privacy concerns. Contextual embedding models have demonstrated a very strong representative capability to describe medical concepts (and their context), and they have shown promise as an alternative way to support deep-learning applications without the need to disclose original data. However, contextual embedding models acquired from individual hospitals cannot be directly combined because their embedding spaces are different, and naive pooling renders combined embeddings useless. Objective: The aim of this study was to present a novel approach to address these issues and to promote sharing representation without sharing data. Without sacrificing privacy, we also aimed to build a global model from representations learned from local private data and synchronize information from multiple sources. Methods: We propose a methodology that harmonizes different local contextual embeddings into a global model. We used Word2Vec to generate contextual embeddings from each source and Procrustes to fuse different vector models into one common space by using a list of corresponding pairs as anchor points. We performed prediction analysis with harmonized embeddings. Results: We used sequential medical events extracted from the Medical Information Mart for Intensive Care III database to evaluate the proposed methodology in predicting the next likely diagnosis of a new patient using either structured data or unstructured data. Under different experimental scenarios, we confirmed that the global model built from harmonized local models achieves a more accurate prediction than local models and global models built from naive pooling. Conclusions: Such aggregation of local models using our unique harmonization can serve as the proxy for a global model, combining information from a wide range of institutions and information sources. It allows information unique to a certain hospital to become available to other sites, increasing the fluidity of information flow in health care

    From Raw Data to FAIR Data: The FAIRification Workflow for Health Research

    Get PDF
    BackgroundFAIR (findability, accessibility, interoperability, and reusability) guidingprinciples seek the reuse of data and other digital research input, output, and objects(algorithms, tools, and workflows that led to that data) making themfindable, accessible,interoperable, and reusable. GO FAIR - a bottom-up, stakeholder driven and self-governedinitiative-defined a seven-step FAIRificationprocessfocusingondata,butalsoindicatingtherequired work for metadata. This FAIRification process aims at addressing the translation ofraw datasets into FAIR datasets in a general way, without considering specific requirementsand challenges that may arise when dealing with some particular types of data.This work was performed in the scope of FAIR4Healthproject. FAIR4Health has received funding from the European Union’s Horizon 2020 research and innovationprogramme under grant agreement number 824666

    Toward the development of data governance standards for using clinical free-text data in health research: position paper

    Get PDF
    Background: Clinical free-text data (eg, outpatient letters or nursing notes) represent a vast, untapped source of rich information that, if more accessible for research, would clarify and supplement information coded in structured data fields. Data usually need to be deidentified or anonymized before they can be reused for research, but there is a lack of established guidelines to govern effective deidentification and use of free-text information and avoid damaging data utility as a by-product. Objective: This study aimed to develop recommendations for the creation of data governance standards to integrate with existing frameworks for personal data use, to enable free-text data to be used safely for research for patient and public benefit. Methods: We outlined data protection legislation and regulations relating to the United Kingdom for context and conducted a rapid literature review and UK-based case studies to explore data governance models used in working with free-text data. We also engaged with stakeholders, including text-mining researchers and the general public, to explore perceived barriers and solutions in working with clinical free-text. Results: We proposed a set of recommendations, including the need for authoritative guidance on data governance for the reuse of free-text data, to ensure public transparency in data flows and uses, to treat deidentified free-text data as potentially identifiable with use limited to accredited data safe havens, and to commit to a culture of continuous improvement to understand the relationships between the efficacy of deidentification and reidentification risks, so this can be communicated to all stakeholders. Conclusions: By drawing together the findings of a combination of activities, we present a position paper to contribute to the development of data governance standards for the reuse of clinical free-text data for secondary purposes. While working in accordance with existing data governance frameworks, there is a need for further work to take forward the recommendations we have proposed, with commitment and investment, to assure and expand the safe reuse of clinical free-text data for public benefit

    Data-Driven and Artificial Intelligence (AI) Approach for Modelling and Analyzing Healthcare Security Practice: A Systematic Review

    Get PDF
    Data breaches in healthcare continue to grow exponentially, calling for a rethinking into better approaches of security measures towards mitigating the menace. Traditional approaches including technological measures, have significantly contributed to mitigating data breaches but what is still lacking is the development of the “human firewall,” which is the conscious care security practices of the insiders. As a result, the healthcare security practice analysis, modeling and incentivization project (HSPAMI) is geared towards analyzing healthcare staffs’ security practices in various scenarios including big data. The intention is to determine the gap between staffs’ security practices and required security practices for incentivization measures. To address the state-of-the art, a systematic review was conducted to pinpoint appropriate AI methods and data sources that can be used for effective studies. Out of about 130 articles, which were initially identified in the context of human-generated healthcare data for security measures in healthcare, 15 articles were found to meet the inclusion and exclusion criteria. A thorough assessment and analysis of the included article reveals that, KNN, Bayesian Network and Decision Trees (C4.5) algorithms were mostly applied on Electronic Health Records (EHR) Logs and Network logs with varying input features of healthcare staffs’ security practices. What was found challenging is the performance scores of these algorithms which were not sufficiently outlined in the existing studies

    The 2022 n2c2/UW Shared Task on Extracting Social Determinants of Health

    Full text link
    Objective: The n2c2/UW SDOH Challenge explores the extraction of social determinant of health (SDOH) information from clinical notes. The objectives include the advancement of natural language processing (NLP) information extraction techniques for SDOH and clinical information more broadly. This paper presents the shared task, data, participating teams, performance results, and considerations for future work. Materials and Methods: The task used the Social History Annotated Corpus (SHAC), which consists of clinical text with detailed event-based annotations for SDOH events such as alcohol, drug, tobacco, employment, and living situation. Each SDOH event is characterized through attributes related to status, extent, and temporality. The task includes three subtasks related to information extraction (Subtask A), generalizability (Subtask B), and learning transfer (Subtask C). In addressing this task, participants utilized a range of techniques, including rules, knowledge bases, n-grams, word embeddings, and pretrained language models (LM). Results: A total of 15 teams participated, and the top teams utilized pretrained deep learning LM. The top team across all subtasks used a sequence-to-sequence approach achieving 0.901 F1 for Subtask A, 0.774 F1 Subtask B, and 0.889 F1 for Subtask C. Conclusions: Similar to many NLP tasks and domains, pretrained LM yielded the best performance, including generalizability and learning transfer. An error analysis indicates extraction performance varies by SDOH, with lower performance achieved for conditions, like substance use and homelessness, that increase health risks (risk factors) and higher performance achieved for conditions, like substance abstinence and living with family, that reduce health risks (protective factors)

    Understanding views around the creation of a consented, donated databank of clinical free text to develop and train natural language processing models for research: focus group interviews with stakeholders

    Get PDF
    Background: Information stored within electronic health records is often recorded as unstructured text. Special computerized natural language processing (NLP) tools are needed to process this text; however, complex governance arrangements make such data in the National Health Service hard to access, and therefore, it is difficult to use for research in improving NLP methods. The creation of a donated databank of clinical free text could provide an important opportunity for researchers to develop NLP methods and tools and may circumvent delays in accessing the data needed to train the models. However, to date, there has been little or no engagement with stakeholders on the acceptability and design considerations of establishing a free-text databank for this purpose. Objective: This study aimed to ascertain stakeholder views around the creation of a consented, donated databank of clinical free text to help create, train, and evaluate NLP for clinical research and to inform the potential next steps for adopting a partner-led approach to establish a national, funded databank of free text for use by the research community. Methods: Web-based in-depth focus group interviews were conducted with 4 stakeholder groups (patients and members of the public, clinicians, information governance leads and research ethics members, and NLP researchers). Results: All stakeholder groups were strongly in favor of the databank and saw great value in creating an environment where NLP tools can be tested and trained to improve their accuracy. Participants highlighted a range of complex issues for consideration as the databank is developed, including communicating the intended purpose, the approach to access and safeguarding the data, who should have access, and how to fund the databank. Participants recommended that a small-scale, gradual approach be adopted to start to gather donations and encouraged further engagement with stakeholders to develop a road map and set of standards for the databank. Conclusions: These findings provide a clear mandate to begin developing the databank and a framework for stakeholder expectations, which we would aim to meet with the databank delivery
    • 

    corecore