9 research outputs found
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records
Unstructured information in electronic health records provide an invaluable
resource for medical research. To protect the confidentiality of patients and
to conform to privacy regulations, de-identification methods automatically
remove personally identifying information from these medical records. However,
due to the unavailability of labeled data, most existing research is
constrained to English medical text and little is known about the
generalizability of de-identification methods across languages and domains. In
this study, we construct a varied dataset consisting of the medical records of
1260 patients by sampling data from 9 institutes and three domains of Dutch
healthcare. We test the generalizability of three de-identification methods
across languages and domains. Our experiments show that an existing rule-based
method specifically developed for the Dutch language fails to generalize to
this new data. Furthermore, a state-of-the-art neural architecture performs
strongly across languages and domains, even with limited training data.
Compared to feature-based and rule-based methods the neural method requires
significantly less configuration effort and domain-knowledge. We make all code
and pre-trained de-identification models available to the research community,
allowing practitioners to apply them to their datasets and to enable future
benchmarks.Comment: Proceedings of the 1st ACM WSDM Health Search and Data Mining
Workshop (HSDM2020), 202
Safeguarding Privacy Through Deep Learning Techniques
Over the last few years, there has been a growing need to meet minimum security and privacy requirements. Both public and private companies have had to comply with increasingly stringent standards, such as the ISO 27000 family of standards, or the various laws governing the management of personal data. The huge amount of data to be managed has required a huge effort from the employees who, in the absence of automatic techniques, have had to work tirelessly to achieve the certification objectives. Unfortunately, due to the delicate information contained in the documentation relating to these problems, it is difficult if not impossible to obtain material for research and study purposes on which to experiment new ideas and techniques aimed at automating processes, perhaps exploiting what is in ferment in the scientific community and linked to the fields of ontologies and artificial intelligence for data management. In order to bypass this problem, it was decided to examine data related to the medical world, which, especially for important reasons related to the health of individuals, have gradually become more and more freely accessible over time, without affecting the generality of the proposed methods, which can be reapplied to the most diverse fields in which there is a need to manage privacy-sensitive information
Bruk av naturlig språkprosessering i psykiatri: En systematisk kartleggingsoversikt
Bakgrunn: Bruk av kunstig intelligens (AI) har et stadig økende fokus, også i helsevesenet. En metode som virker lovende, er naturlig språkprosessering (NLP), som kan brukes til analysering av skriftlig tekst, for eksempel tekst i elektroniske pasientjournaler. Denne undersøkelsen har som formål å undersøke forskning som er gjort på bruk av naturlig språkprosessering for analysering av elektroniske journaler fra pasienter med alvorlige psykiske lidelser, som affektive lidelser og psykoselidelser. Den overordnete hensikten med dette, er å få et inntrykk av om noe av forskningen som er gjort har fokus på forbedring av pasientenes helsesituasjon.
Materiale og metode: Det ble gjennomført en systematisk kartleggingsoversikt («scoping review»). Litteratursøket ble gjort i én database for medisinsk forskning, PubMed, med søketermene «psychiatry», «electronic medical records» og «natural language processing». Søket var ikke avgrenset i tid. For at en artikkel skulle bli inkludert i undersøkelsen måtte den være empirisk, ha utført analyser på journaldata i fritekst, ha brukt elektroniske journaler fra psykiatriske pasienter med psykoselidelser og/eller affektive lidelser og være skrevet på engelsk språk.
Resultater: Litteratursøket resulterte i totalt 211 unike artikler, av disse oppfylte 37 artikler inklusjonskriteriene i kartleggingsoversikten, og ble undersøkt videre. De fleste av studiene var gjennomført i Storbritannia og USA. Størrelsen på studiepopulasjonen varierte mye, fra noen hundre til flere hundre tusen inkluderte pasienter i studiene. Det var lite av forskningen som var gjort på spesifikke dokumenttyper fra pasientjournal, som for eksempel epikriser eller innkomstjournaler. Hensikten for studiene varierte mye, men kunne deles inn i noen felles kategorier: 1) identifisering av informasjon fra journal, 2) kvantitative undersøkelser av populasjonen eller journalene, 3) seleksjon av pasienter til kohorter og 4) vurdering av risiko.
Fortolkning: Det trengs mer grunnforskning før teknologi for naturlig språkprosessering til analyse av elektronisk journal vil bidra med forbedring av psykiatriske pasienters helsesituasjon
Recommended from our members
Novel reversible text data de-identification techniques based on native data structures
Technological development in today's digital world has resulted in the collection and storage of large amounts of personal data. These data enable both direct services and non-direct activities, known as secondary use. The secondary use of data can improve decision-making, service experiences, and healthcare systems. However, the widespread reuse of personal data raises significant privacy and policy issues, especially for health- related information; these data may contain sensitive data, leading to privacy breaches if compromised. Legal systems establish laws to protect the privacy of personal data disclosed for secondary use. A well-known example is the General Data Protection Regulation (GDPR), which outlines a specific set of rules for sharing and storing personal data to protect individual privacy. The GDPR explicitly points to data de-identification, especially pseudonymization, as one measure that can help meet the requirements for the processing of personal data.
The literature on privacy preservation approaches has largely been developed in the field of data anonymization, where personal data are irreversibly removed or obfuscated and there is no means by which to recover an individual's identity if needed. By contrast, pseudonymization is a promising technique to protect privacy while enabling the recovery of de-identified data. Significantly, many existing approaches for pseudonymization were developed long before the GDPR requirements were established, and so they may fail to satisfy its provisions. Therefore, it is worthwhile to offer technical solutions to preserve privacy while supporting the legitimate use of data.
This thesis proposes a novel de-identification system for unstructured textual data, known as ARTPHIL, that generates de-identified data in compliance with the GDPR requirement for strong pseudonymization. The system was evaluated using 2014 i2b2 testing data. The proposed system achieved a recall of 96.93% in terms of detecting and encrypting personal health information, as specified under guidelines provided by the Health Insurance Portability and Accountability Act (HIPAA). The system used a novel and lightweight cryptography algorithm E-ART to encrypt personal data cost-effectively and without compromising security. The main novelty of the E-ART algorithm is the use of the reflection property of a balanced binary tree data structure as substitution method instead of complex and multiple iterations. The performance and security of the proposed algorithm were compared to two symmetric encryption algorithms: The Advanced Encryption Standard and Data Encryption Standard. The security analysis showed comparable results, but the performance analysis indicated that E‐ART had the shortest ciphertext and running time with comparable memory usage, which indicates the feasibility of using ARTPHIL for delay-sensitive or data-intensive application
Using machine learning for automated de-identification and clinical coding of free text data in electronic medical records
The widespread adoption of Electronic Medical Records (EMRs) in hospitals continues to increase the amount of patient data that are digitally stored. Although the primary use of the EMR is to support patient care by making all relevant information accessible, governments and health organisations are looking for ways to unleash the potential of these data for secondary purposes, including clinical research, disease surveillance and automation of healthcare processes and workflows.
EMRs include large quantities of free text documents that contain valuable information. The greatest challenges in using the free text data in EMRs include the removal of personally identifiable information and the extraction of relevant information for specific tasks such as clinical coding. Machine learning-based automated approaches can potentially address these challenges.
This thesis aims to explore and improve the performance of machine learning models for automated de-identification and clinical coding of free text data in EMRs, as captured in hospital discharge summaries, and facilitate the applications of these approaches in real-world use cases. It does so by 1) implementing an end-to-end de-identification framework using an ensemble of deep learning models; 2) developing a web-based system for de-identification of free text (DEFT) with an interactive learning loop; 3) proposing and implementing a hierarchical label-wise attention transformer model (HiLAT) for explainable International Classification of Diseases (ICD) coding; and 4) investigating the use of extreme multi-label long text transformer-based models for automated ICD coding.
The key findings include: 1) An end-to-end framework using an ensemble of deep learning base-models achieved excellent performance on the de-identification task. 2) A new web-based de-identification software system (DEFT) can be readily and easily adopted by data custodians and researchers to perform de-identification of free text in EMRs. 3) A novel domain-specific transformer-based model (HiLAT) achieved state-of-the-art (SOTA) results for predicting ICD codes on a Medical Information Mart for Intensive Care (MIMIC-III) dataset comprising the discharge summaries (n=12,808) that are coded with at least one of the most 50 frequent diagnosis and procedure codes. In addition, the label-wise attention scores for the tokens in the discharge summary presented a potential explainability tool for checking the face validity of ICD code predictions. 4) An optimised transformer-based model, PLM-ICD, achieved the latest SOTA results for ICD coding on all the discharge summaries of the MIMIC-III dataset (n=59,652). The segmentation method, which split the long text consecutively into multiple small chunks, addressed the problem of applying transformer-based models to long text datasets. However, using transformer-based models on extremely large label sets needs further research.
These findings demonstrate that the de-identification and clinical coding tasks can benefit from the application of machine learning approaches, present practical tools for implementing these approaches, and highlight priorities for further research