9 research outputs found

    Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records

    Get PDF
    Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration effort and domain-knowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.Comment: Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop (HSDM2020), 202

    Safeguarding Privacy Through Deep Learning Techniques

    Get PDF
    Over the last few years, there has been a growing need to meet minimum security and privacy requirements. Both public and private companies have had to comply with increasingly stringent standards, such as the ISO 27000 family of standards, or the various laws governing the management of personal data. The huge amount of data to be managed has required a huge effort from the employees who, in the absence of automatic techniques, have had to work tirelessly to achieve the certification objectives. Unfortunately, due to the delicate information contained in the documentation relating to these problems, it is difficult if not impossible to obtain material for research and study purposes on which to experiment new ideas and techniques aimed at automating processes, perhaps exploiting what is in ferment in the scientific community and linked to the fields of ontologies and artificial intelligence for data management. In order to bypass this problem, it was decided to examine data related to the medical world, which, especially for important reasons related to the health of individuals, have gradually become more and more freely accessible over time, without affecting the generality of the proposed methods, which can be reapplied to the most diverse fields in which there is a need to manage privacy-sensitive information

    Bruk av naturlig språkprosessering i psykiatri: En systematisk kartleggingsoversikt

    Get PDF
    Bakgrunn: Bruk av kunstig intelligens (AI) har et stadig økende fokus, også i helsevesenet. En metode som virker lovende, er naturlig språkprosessering (NLP), som kan brukes til analysering av skriftlig tekst, for eksempel tekst i elektroniske pasientjournaler. Denne undersøkelsen har som formål å undersøke forskning som er gjort på bruk av naturlig språkprosessering for analysering av elektroniske journaler fra pasienter med alvorlige psykiske lidelser, som affektive lidelser og psykoselidelser. Den overordnete hensikten med dette, er å få et inntrykk av om noe av forskningen som er gjort har fokus på forbedring av pasientenes helsesituasjon. Materiale og metode: Det ble gjennomført en systematisk kartleggingsoversikt («scoping review»). Litteratursøket ble gjort i én database for medisinsk forskning, PubMed, med søketermene «psychiatry», «electronic medical records» og «natural language processing». Søket var ikke avgrenset i tid. For at en artikkel skulle bli inkludert i undersøkelsen måtte den være empirisk, ha utført analyser på journaldata i fritekst, ha brukt elektroniske journaler fra psykiatriske pasienter med psykoselidelser og/eller affektive lidelser og være skrevet på engelsk språk. Resultater: Litteratursøket resulterte i totalt 211 unike artikler, av disse oppfylte 37 artikler inklusjonskriteriene i kartleggingsoversikten, og ble undersøkt videre. De fleste av studiene var gjennomført i Storbritannia og USA. Størrelsen på studiepopulasjonen varierte mye, fra noen hundre til flere hundre tusen inkluderte pasienter i studiene. Det var lite av forskningen som var gjort på spesifikke dokumenttyper fra pasientjournal, som for eksempel epikriser eller innkomstjournaler. Hensikten for studiene varierte mye, men kunne deles inn i noen felles kategorier: 1) identifisering av informasjon fra journal, 2) kvantitative undersøkelser av populasjonen eller journalene, 3) seleksjon av pasienter til kohorter og 4) vurdering av risiko. Fortolkning: Det trengs mer grunnforskning før teknologi for naturlig språkprosessering til analyse av elektronisk journal vil bidra med forbedring av psykiatriske pasienters helsesituasjon

    Using machine learning for automated de-identification and clinical coding of free text data in electronic medical records

    Full text link
    The widespread adoption of Electronic Medical Records (EMRs) in hospitals continues to increase the amount of patient data that are digitally stored. Although the primary use of the EMR is to support patient care by making all relevant information accessible, governments and health organisations are looking for ways to unleash the potential of these data for secondary purposes, including clinical research, disease surveillance and automation of healthcare processes and workflows. EMRs include large quantities of free text documents that contain valuable information. The greatest challenges in using the free text data in EMRs include the removal of personally identifiable information and the extraction of relevant information for specific tasks such as clinical coding. Machine learning-based automated approaches can potentially address these challenges. This thesis aims to explore and improve the performance of machine learning models for automated de-identification and clinical coding of free text data in EMRs, as captured in hospital discharge summaries, and facilitate the applications of these approaches in real-world use cases. It does so by 1) implementing an end-to-end de-identification framework using an ensemble of deep learning models; 2) developing a web-based system for de-identification of free text (DEFT) with an interactive learning loop; 3) proposing and implementing a hierarchical label-wise attention transformer model (HiLAT) for explainable International Classification of Diseases (ICD) coding; and 4) investigating the use of extreme multi-label long text transformer-based models for automated ICD coding. The key findings include: 1) An end-to-end framework using an ensemble of deep learning base-models achieved excellent performance on the de-identification task. 2) A new web-based de-identification software system (DEFT) can be readily and easily adopted by data custodians and researchers to perform de-identification of free text in EMRs. 3) A novel domain-specific transformer-based model (HiLAT) achieved state-of-the-art (SOTA) results for predicting ICD codes on a Medical Information Mart for Intensive Care (MIMIC-III) dataset comprising the discharge summaries (n=12,808) that are coded with at least one of the most 50 frequent diagnosis and procedure codes. In addition, the label-wise attention scores for the tokens in the discharge summary presented a potential explainability tool for checking the face validity of ICD code predictions. 4) An optimised transformer-based model, PLM-ICD, achieved the latest SOTA results for ICD coding on all the discharge summaries of the MIMIC-III dataset (n=59,652). The segmentation method, which split the long text consecutively into multiple small chunks, addressed the problem of applying transformer-based models to long text datasets. However, using transformer-based models on extremely large label sets needs further research. These findings demonstrate that the de-identification and clinical coding tasks can benefit from the application of machine learning approaches, present practical tools for implementing these approaches, and highlight priorities for further research
    corecore