222 research outputs found

    In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

    Full text link
    Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.Comment: Accepted by FAccT 2023; updated appendix with the de-identification performance of GPT-

    Using data-driven sublanguage pattern mining to induce knowledge models: application in medical image reports knowledge representation

    Get PDF
    Background: The use of knowledge models facilitates information retrieval, knowledge base development, and therefore supports new knowledge discovery that ultimately enables decision support applications. Most existing works have employed machine learning techniques to construct a knowledge base. However, they often suffer from low precision in extracting entity and relationships. In this paper, we described a data-driven sublanguage pattern mining method that can be used to create a knowledge model. We combined natural language processing (NLP) and semantic network analysis in our model generation pipeline. Methods: As a use case of our pipeline, we utilized data from an open source imaging case repository, Radiopaedia.org, to generate a knowledge model that represents the contents of medical imaging reports. We extracted entities and relationships using the Stanford part-of-speech parser and the “Subject:Relationship:Object” syntactic data schema. The identified noun phrases were tagged with the Unified Medical Language System (UMLS) semantic types. An evaluation was done on a dataset comprised of 83 image notes from four data sources. Results: A semantic type network was built based on the co-occurrence of 135 UMLS semantic types in 23,410 medical image reports. By regrouping the semantic types and generalizing the semantic network, we created a knowledge model that contains 14 semantic categories. Our knowledge model was able to cover 98% of the content in the evaluation corpus and revealed 97% of the relationships. Machine annotation achieved a precision of 87%, recall of 79%, and F-score of 82%. Conclusion: The results indicated that our pipeline was able to produce a comprehensive content-based knowledge model that could represent context from various sources in the same domain

    Hypersensitivity Adverse Event Reporting in Clinical Cancer Trials: Barriers and Potential Solutions to Studying Severe Events on a Population Level

    Get PDF
    ABSTRACT HYPERSENSITIVITY ADVERSE EVENT REPORTING IN CLINICAL CANCER TRIALS: BARRIERS AND POTENTIAL SOLUTIONS TO STUDYING ALLERGIC EVENTS ON A POPULATION LEVEL by Christina Eldredge The University of Wisconsin-Milwaukee, 2020 Under the Supervision of Professor Timothy Patrick Clinical cancer trial interventions are associated with hypersensitivity events (HEs) which are recorded in the national clinical trial registry, ClinicalTrials.gov and publicly available. This data could potentially be leveraged to study predictors for HEs to identify at risk patients who may benefit from desensitization therapies to prevent these potentially life-threatening reactions. However, variation in investigator reporting methods is a barrier to leveraging this data for aggregation and analysis. The National Cancer Institute has developed the CTCAE classification system to address this barrier. This study analyzes the comprehensiveness of CTCAE to describe severe HEs in clinical cancer trials in comparison to other systems or terminologies. An XML parser was used to extract readable text from adverse event tables. Queries of the parsed data elements were performed to identify immune disorder events associated with biological and chemotherapy interventions. A data subset of severe anaphylactic and anaphylactoid events was created and analyzed. 1,331 clinical trials with 13088 immune disorder events occurred from September 20, 1999 to March 2018. 2409 (18.4%) of these were recorded as “serious” events. In the severe subset, MedDRA terminology, CTCAE or CTC classification systems were used to describe HEs, however, a large number of studies did not specify the system. The CTCAE term “anaphylaxis” was miscoded as “other (not including serious)” in 76.2% of events. The CTCAE classification system severity grades levels were not used to describe any of the severe events and the majority of terms did not include the allergen and therefore, in dual or multi- drug therapies, the etiologic agent was not identifiable. Furthermore, collection methods were not specified in 76% of events. Therefore, CTCAE was not found to improve the ability to capture event etiology or severity in anaphylaxis and anaphylactoid events in cancer clinical trials. Potential solutions to improving CTCAE HE description include adapting terms with a low percentage of HE severity miscoding (e.g. anaphylactic reaction) and terms which include drugs, biological agents and/or drug classes to improve study of anaphylaxis etiology and incidence in multi-drug cancer therapy, therefore, making a significant impact on patient safety

    Automated Detection of Substance-Use Status and Related Information from Clinical Text

    Get PDF
    This study aims to develop and evaluate an automated system for extracting information related to patient substance use (smoking, alcohol, and drugs) from unstructured clinical text (medical discharge records). The authors propose a four-stage system for the extraction of the substance-use status and related attributes (type, frequency, amount, quit-time, and period). The first stage uses a keyword search technique to detect sentences related to substance use and to exclude unrelated records. In the second stage, an extension of the NegEx negation detection algorithm is developed and employed for detecting the negated records. The third stage involves identifying the temporal status of the substance use by applying windowing and chunking methodologies. Finally, in the fourth stage, regular expressions, syntactic patterns, and keyword search techniques are used in order to extract the substance-use attributes. The proposed system achieves an F1-score of up to 0.99 for identifying substance-use-related records, 0.98 for detecting the negation status, and 0.94 for identifying temporal status. Moreover, F1-scores of up to 0.98, 0.98, 1.00, 0.92, and 0.98 are achieved for the extraction of the amount, frequency, type, quit-time, and period attributes, respectively. Natural Language Processing (NLP) and rule-based techniques are employed efficiently for extracting substance-use status and attributes, with the proposed system being able to detect substance-use status and attributes over both sentence-level and document-level data. Results show that the proposed system outperforms the compared state-of-the-art substance-use identification system on an unseen dataset, demonstrating its generalisability

    Barry Smith an sich

    Get PDF
    Festschrift in Honor of Barry Smith on the occasion of his 65th Birthday. Published as issue 4:4 of the journal Cosmos + Taxis: Studies in Emergent Order and Organization. Includes contributions by Wolfgang Grassl, Nicola Guarino, John T. Kearns, Rudolf Lüthe, Luc Schneider, Peter Simons, Wojciech Żełaniec, and Jan Woleński

    Spanish named entity recognition in the biomedical domain

    Get PDF
    Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft

    Publications by Barry Smith

    Get PDF

    Private hospital workflow optimization via secure k-means clustering

    Get PDF
    Optimizing the workflow of a complex organization such as a hospital is a difficult task. An accurate option is to use a real-time locating system to track locations of both patients and staff. However, privacy regulations forbid hospital management to assess location data of their staff members. In this exploratory work, we propose a secure solution to analyze the joined location data of patients and staff, by means of an innovative cryptographic technique called Secure Multi-Party Computation, in which an additional entity that the staff members can trust, such as a labour union, takes care of the staff data. The hospital, owning location data of patients, and the labour union perform a two-party protocol, in which they securely cluster the staff members by means of the frequency of their patient facing times. We describe the secure solution in detail, and evaluate the performance of our proof-of-concept. This work thus demonstrates the feasibility of secure multi-party clustering in this setting
    • …
    corecore