222 research outputs found
In the Name of Fairness: Assessing the Bias in Clinical Record De-identification
Data sharing is crucial for open science and reproducible research, but the
legal sharing of clinical data requires the removal of protected health
information from electronic health records. This process, known as
de-identification, is often achieved through the use of machine learning
algorithms by many commercial and open-source systems. While these systems have
shown compelling results on average, the variation in their performance across
different demographic groups has not been thoroughly examined. In this work, we
investigate the bias of de-identification systems on names in clinical notes
via a large-scale empirical analysis. To achieve this, we create 16 name sets
that vary along four demographic dimensions: gender, race, name popularity, and
the decade of popularity. We insert these names into 100 manually curated
clinical templates and evaluate the performance of nine public and private
de-identification methods. Our findings reveal that there are statistically
significant performance gaps along a majority of the demographic dimensions in
most methods. We further illustrate that de-identification quality is affected
by polysemy in names, gender context, and clinical note characteristics. To
mitigate the identified gaps, we propose a simple and method-agnostic solution
by fine-tuning de-identification methods with clinical context and diverse
names. Overall, it is imperative to address the bias in existing methods
immediately so that downstream stakeholders can build high-quality systems to
serve all demographic parties fairly.Comment: Accepted by FAccT 2023; updated appendix with the de-identification
performance of GPT-
Using data-driven sublanguage pattern mining to induce knowledge models: application in medical image reports knowledge representation
Background: The use of knowledge models facilitates information retrieval, knowledge base development, and therefore supports new knowledge discovery that ultimately enables decision support applications. Most existing works have employed machine learning techniques to construct a knowledge base. However, they often suffer from low precision in extracting entity and relationships. In this paper, we described a data-driven sublanguage pattern mining method that can be used to create a knowledge model. We combined natural language processing (NLP) and semantic network analysis in our model generation pipeline.
Methods: As a use case of our pipeline, we utilized data from an open source imaging case repository, Radiopaedia.org, to generate a knowledge model that represents the contents of medical imaging reports. We extracted entities and relationships using the Stanford part-of-speech parser and the “Subject:Relationship:Object” syntactic data schema. The identified noun phrases were tagged with the Unified Medical Language System (UMLS) semantic types. An evaluation was done on a dataset comprised of 83 image notes from four data sources.
Results: A semantic type network was built based on the co-occurrence of 135 UMLS semantic types in 23,410 medical image reports. By regrouping the semantic types and generalizing the semantic network, we created a knowledge model that contains 14 semantic categories. Our knowledge model was able to cover 98% of the content in the evaluation corpus and revealed 97% of the relationships. Machine annotation achieved a precision of 87%, recall of 79%, and F-score of 82%.
Conclusion: The results indicated that our pipeline was able to produce a comprehensive content-based knowledge model that could represent context from various sources in the same domain
Hypersensitivity Adverse Event Reporting in Clinical Cancer Trials: Barriers and Potential Solutions to Studying Severe Events on a Population Level
ABSTRACT
HYPERSENSITIVITY ADVERSE EVENT REPORTING IN CLINICAL CANCER TRIALS: BARRIERS AND POTENTIAL SOLUTIONS TO STUDYING ALLERGIC EVENTS ON A POPULATION LEVEL
by
Christina Eldredge
The University of Wisconsin-Milwaukee, 2020
Under the Supervision of Professor Timothy Patrick
Clinical cancer trial interventions are associated with hypersensitivity events (HEs) which are recorded in the national clinical trial registry, ClinicalTrials.gov and publicly available. This data could potentially be leveraged to study predictors for HEs to identify at risk patients who may benefit from desensitization therapies to prevent these potentially life-threatening reactions. However, variation in investigator reporting methods is a barrier to leveraging this data for aggregation and analysis. The National Cancer Institute has developed the CTCAE classification system to address this barrier. This study analyzes the comprehensiveness of CTCAE to describe severe HEs in clinical cancer trials in comparison to other systems or terminologies.
An XML parser was used to extract readable text from adverse event tables. Queries of the parsed data elements were performed to identify immune disorder events associated with biological and chemotherapy interventions. A data subset of severe anaphylactic and anaphylactoid events was created and analyzed.
1,331 clinical trials with 13088 immune disorder events occurred from September 20, 1999 to March 2018. 2409 (18.4%) of these were recorded as “serious” events. In the severe subset, MedDRA terminology, CTCAE or CTC classification systems were used to describe HEs, however, a large number of studies did not specify the system. The CTCAE term “anaphylaxis” was miscoded as “other (not including serious)” in 76.2% of events. The CTCAE classification system severity grades levels were not used to describe any of the severe events and the majority of terms did not include the allergen and therefore, in dual or multi- drug therapies, the etiologic agent was not identifiable. Furthermore, collection methods were not specified in 76% of events.
Therefore, CTCAE was not found to improve the ability to capture event etiology or severity in anaphylaxis and anaphylactoid events in cancer clinical trials. Potential solutions to improving CTCAE HE description include adapting terms with a low percentage of HE severity miscoding (e.g. anaphylactic reaction) and terms which include drugs, biological agents and/or drug classes to improve study of anaphylaxis etiology and incidence in multi-drug cancer therapy, therefore, making a significant impact on patient safety
Automated Detection of Substance-Use Status and Related Information from Clinical Text
This study aims to develop and evaluate an automated system for extracting information related to patient substance use (smoking, alcohol, and drugs) from unstructured clinical text (medical discharge records). The authors propose a four-stage system for the extraction of the substance-use status and related attributes (type, frequency, amount, quit-time, and period). The first stage uses a keyword search technique to detect sentences related to substance use and to exclude unrelated records. In the second stage, an extension of the NegEx negation detection algorithm is developed and employed for detecting the negated records. The third stage involves identifying the temporal status of the substance use by applying windowing and chunking methodologies. Finally, in the fourth stage, regular expressions, syntactic patterns, and keyword search techniques are used in order to extract the substance-use attributes. The proposed system achieves an F1-score of up to 0.99 for identifying substance-use-related records, 0.98 for detecting the negation status, and 0.94 for identifying temporal status. Moreover, F1-scores of up to 0.98, 0.98, 1.00, 0.92, and 0.98 are achieved for the extraction of the amount, frequency, type, quit-time, and period attributes, respectively. Natural Language Processing (NLP) and rule-based techniques are employed efficiently for extracting substance-use status and attributes, with the proposed system being able to detect substance-use status and attributes over both sentence-level and document-level data. Results show that the proposed system outperforms the compared state-of-the-art substance-use identification system on an unseen dataset, demonstrating its generalisability
Barry Smith an sich
Festschrift in Honor of Barry Smith on the occasion of his 65th Birthday. Published as issue 4:4 of the journal Cosmos + Taxis: Studies in Emergent Order and Organization. Includes contributions by Wolfgang Grassl, Nicola Guarino, John T. Kearns, Rudolf Lüthe, Luc Schneider, Peter Simons, Wojciech Żełaniec, and Jan Woleński
Spanish named entity recognition in the biomedical domain
Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft
Private hospital workflow optimization via secure k-means clustering
Optimizing the workflow of a complex organization such as a hospital is a difficult task. An accurate option is to use a real-time locating system to track locations of both patients and staff. However, privacy regulations forbid hospital management to assess location data of their staff members. In this exploratory work, we propose a secure solution to analyze the joined location data of patients and staff, by means of an innovative cryptographic technique called Secure Multi-Party Computation, in which an additional entity that the staff members can trust, such as a labour union, takes care of the staff data. The hospital, owning location data of patients, and the labour union perform a two-party protocol, in which they securely cluster the staff members by means of the frequency of their patient facing times. We describe the secure solution in detail, and evaluate the performance of our proof-of-concept. This work thus demonstrates the feasibility of secure multi-party clustering in this setting
- …