3 research outputs found

    In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

    Full text link
    Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.Comment: Accepted by FAccT 2023; updated appendix with the de-identification performance of GPT-

    HIPAAway: developing software for de-identification and exploring bias in name detection

    No full text
    De-identification, the process of removing identifiers, is a crucial step in the preparation of clinical data for use in biomedical research. Advances in natural language processing have increased interest in developing an accurate and adaptable automatic de-identification system for clinical text. Models for de-identification have been found successful but are largely unavailable for public use due to a lack of provided code and a cost associated with using commercial models. A lack of transparency in deidentification model training may bias the models against certain demographic groups, which are hidden in overall performance metrics and need to be evaluated due to the disproportionate potential harm to marginalized communities. In this thesis, we review current de-identification methods, present a new de-identification dataset, audit demographic biases in existing de-identification approaches, and develop an easy-to-use, open-source de-identification software package. This package would make clinical text de-identification more accessible to researchers and clinicians, alleviating the bottleneck of de-identification to free up more data for biomedical research. This would help make future research more robust and beneficial to not only the medical community, but also people around the world.M.Eng

    In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

    No full text
    corecore