62 research outputs found
Recommended from our members
Discovering body site and severity modifiers in clinical texts
Objective: To research computational methods for discovering body site and severity modifiers in clinical texts. Methods: We cast the task of discovering body site and severity modifiers as a relation extraction problem in the context of a supervised machine learning framework. We utilize rich linguistic features to represent the pairs of relation arguments and delegate the decision about the nature of the relationship between them to a support vector machine model. We evaluate our models using two corpora that annotate body site and severity modifiers. We also compare the model performance to a number of rule-based baselines. We conduct cross-domain portability experiments. In addition, we carry out feature ablation experiments to determine the contribution of various feature groups. Finally, we perform error analysis and report the sources of errors. Results: The performance of our method for discovering body site modifiers achieves F1 of 0.740–0.908 and our method for discovering severity modifiers achieves F1 of 0.905–0.929. Discussion Results indicate that both methods perform well on both in-domain and out-domain data, approaching the performance of human annotators. The most salient features are token and named entity features, although syntactic dependency features also contribute to the overall performance. The dominant sources of errors are infrequent patterns in the data and inability of the system to discern deeper semantic structures. Conclusions: We investigated computational methods for discovering body site and severity modifiers in clinical texts. Our best system is released open source as part of the clinical Text Analysis and Knowledge Extraction System (cTAKES)
Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification
Recent advances in large language models (LLMs) have shown impressive ability
in biomedical question-answering, but have not been adequately investigated for
more specific biomedical applications. This study investigates the performance
of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical
tasks beyond question-answering. Because no patient data can be passed to the
OpenAI API public interface, we evaluated model performance with over 10000
samples as proxies for two fundamental tasks in the clinical domain -
classification and reasoning. The first task is classifying whether statements
of clinical and policy recommendations in scientific literature constitute
health advice. The second task is causal relation detection from the biomedical
literature. We compared LLMs with simpler models, such as bag-of-words (BoW)
with logistic regression, and fine-tuned BioBERT models. Despite the excitement
around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks
remained the best strategy. The simple BoW model performed on par with the most
complex LLM prompting. Prompt engineering required significant investment.Comment: 28 pages, 2 tables and 4 figures. Submitting for revie
Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy
Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet
remain under-studied. Real-world evidence holds potential to improve our
understanding of toxicities, but toxicity information is often only in clinical
notes. We developed natural language processing (NLP) models to identify the
presence and severity of esophagitis from notes of patients treated with
thoracic RT. We fine-tuned statistical and pre-trained BERT-based models for
three esophagitis classification tasks: Task 1) presence of esophagitis, Task
2) severe esophagitis or not, and Task 3) no esophagitis vs. grade 1 vs. grade
2-3. Transferability was tested on 345 notes from patients with esophageal
cancer undergoing RT.
Fine-tuning PubmedBERT yielded the best performance. The best macro-F1 was
0.92, 0.82, and 0.74 for Task 1, 2, and 3, respectively. Selecting the most
informative note sections during fine-tuning improved macro-F1 by over 2% for
all tasks. Silver-labeled data improved the macro-F1 by over 3% across all
tasks. For the esophageal cancer notes, the best macro-F1 was 0.73, 0.74, and
0.65 for Task 1, 2, and 3, respectively, without additional fine-tuning.
To our knowledge, this is the first effort to automatically extract
esophagitis toxicity severity according to CTCAE guidelines from clinic notes.
The promising performance provides proof-of-concept for NLP-based automated
detailed toxicity monitoring in expanded domains.Comment: 17 pages, 6 tables, 1figure, submiting to JCO-CCI for revie
Towards comprehensive syntactic and semantic annotations of the clinical narrative
Objective: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. Methods: Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. Results: The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. Conclusions: This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible
Recommended from our members
Automatic Prediction of Rheumatoid Arthritis Disease Activity from the Electronic Medical Records
Objective: We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record. Materials and Methods The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values. Results: Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 (σ = 0.0317), statistically significant as compared to two baselines (AUC = 0.758, σ = 0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers. Conclusion: Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies
Large Language Models to Identify Social Determinants of Health in Electronic Health Records
Social determinants of health (SDoH) have an important impact on patient
outcomes but are incompletely collected from the electronic health records
(EHR). This study researched the ability of large language models to extract
SDoH from free text in EHRs, where they are most commonly documented, and
explored the role of synthetic clinical text for improving the extraction of
these scarcely documented, yet extremely valuable, clinical data. 800 patient
notes were annotated for SDoH categories, and several transformer-based models
were evaluated. The study also experimented with synthetic data generation and
assessed for algorithmic bias. Our best-performing models were fine-tuned
Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The
benefit of augmenting fine-tuning with synthetic data varied across model
architecture and size, with smaller Flan-T5 models (base and large) showing the
greatest improvements in performance (delta F1 +0.12 to +0.23). Model
performance was similar on the in-hospital system dataset but worse on the
MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and
few-shot performance of ChatGPT-family models for both tasks. These fine-tuned
models were less likely than ChatGPT to change their prediction when
race/ethnicity and gender descriptors were added to the text, suggesting less
algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of
patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can
effectively extracted SDoH information from clinic notes, performing better
compare to GPT zero- and few-shot settings. These models could enhance
real-world evidence on SDoH and aid in identifying patients needing social
support.Comment: 38 pages, 5 figures, 5 tables in main, submitted for revie
Recommended from our members
Identification of subjects with polycystic ovary syndrome using electronic health records
Background: Polycystic ovary syndrome (PCOS) is a heterogeneous disorder because of the variable criteria used for diagnosis. Therefore, International Classification of Diseases 9 (ICD-9) codes may not accurately capture the diagnostic criteria necessary for large scale PCOS identification. We hypothesized that use of electronic medical records text and data would more specifically capture PCOS subjects. Methods: Subjects with PCOS were identified in the Partners Healthcare Research Patients Data Registry by searching for the term “polycystic ovary syndrome” using natural language processing (n = 24,930). A training subset of 199 identified charts was reviewed and categorized based on likelihood of a true Rotterdam PCOS diagnosis, i.e. two out of three of the following: irregular menstrual cycles, hyperandrogenism and/or polycystic ovary morphology. Data from the history, physical exam, laboratory and radiology results were codified and extracted from notes of definite PCOS subjects. Thirty-two terms were used to build an algorithm for identifying definite PCOS cases and applied to the rest of the dataset. The positive predictive value cutoff was set at 76.8 % to maximize the number of subjects available for study. A true positive predictive value for the algorithm was calculated after review of 100 charts from subjects identified as definite PCOS cases with at least two documented Rotterdam criteria. The positive predictive value was compared to that calculated using 200 charts identified using the ICD-9 code for PCOS (256.4; n = 13,670). In addition, a cohort of previously recruited PCOS subjects was submitted for algorithm validation. Results: Chart review demonstrated that 64 % were confirmed as definitely PCOS using the algorithm, with a 9 % false positive rate. 66 % of subjects identified by ICD-9 code for PCOS could be confirmed as definitely PCOS, with an 8.5 % false positive rate. There was no significant difference in the positive predictive values using the two methods (p = 0.2). However, the number of charts that had insufficient confirmatory data was lower using the algorithm (5 % vs 11 %; p < 0.04). Of 477 subjects with PCOS recruited and examined individually and present in the database as patients, 451 were found within the algorithm dataset. Conclusions: Extraction of text parameters along with codified data improves the confidence in PCOS patient cohorts identified using the electronic medical record. However, the positive predictive value was not significantly different when using ICD-9 codes or the specific algorithm. Further studies are needed to determine the positive predictive value of the two methods in additional electronic medical record datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12958-015-0115-z) contains supplementary material, which is available to authorized users
The impact of responding to patient messages with large language model assistance
Documentation burden is a major contributor to clinician burnout, which is
rising nationally and is an urgent threat to our ability to care for patients.
Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician
burden by assisting with documentation. Although many hospitals are actively
integrating such systems into electronic medical record systems, AI chatbots
utility and impact on clinical decision-making have not been studied for this
intended use. We are the first to examine the utility of large language models
in assisting clinicians draft responses to patient questions. In our two-stage
cross-sectional study, 6 oncologists responded to 100 realistic synthetic
cancer patient scenarios and portal messages developed to reflect common
medical situations, first manually, then with AI assistance.
We find AI-assisted responses were longer, less readable, but provided
acceptable drafts without edits 58% of time. AI assistance improved efficiency
77% of time, with low harm risk (82% safe). However, 7.7% unedited AI responses
could severely harm. In 31% cases, physicians thought AI drafts were
human-written. AI assistance led to more patient education recommendations,
fewer clinical actions than manual responses. Results show promise for AI to
improve clinician efficiency and patient care through assisting documentation,
if used judiciously. Monitoring model outputs and human-AI interaction remains
crucial for safe implementation.Comment: 4 figures and tables in main, submitted for revie
Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts
Background
Typically, algorithms to classify phenotypes using electronic medical record (EMR) data were developed to perform well in a specific patient population. There is increasing interest in analyses which can allow study of a specific outcome across different diseases. Such a study in the EMR would require an algorithm that can be applied across different patient populations. Our objectives were: (1) to develop an algorithm that would enable the study of coronary artery disease (CAD) across diverse patient populations; (2) to study the impact of adding narrative data extracted using natural language processing (NLP) in the algorithm. Additionally, we demonstrate how to implement CAD algorithm to compare risk across 3 chronic diseases in a preliminary study.
Methods and Results
We studied 3 established EMR based patient cohorts: diabetes mellitus (DM, n = 65,099), inflammatory bowel disease (IBD, n = 10,974), and rheumatoid arthritis (RA, n = 4,453) from two large academic centers. We developed a CAD algorithm using NLP in addition to structured data (e.g. ICD9 codes) in the RA cohort and validated it in the DM and IBD cohorts. The CAD algorithm using NLP in addition to structured data achieved specificity >95% with a positive predictive value (PPV) 90% in the training (RA) and validation sets (IBD and DM). The addition of NLP data improved the sensitivity for all cohorts, classifying an additional 17% of CAD subjects in IBD and 10% in DM while maintaining PPV of 90%. The algorithm classified 16,488 DM (26.1%), 457 IBD (4.2%), and 245 RA (5.0%) with CAD. In a cross-sectional analysis, CAD risk was 63% lower in RA and 68% lower in IBD compared to DM (p<0.0001) after adjusting for traditional cardiovascular risk factors.
Conclusions
We developed and validated a CAD algorithm that performed well across diverse patient populations. The addition of NLP into the CAD algorithm improved the sensitivity of the algorithm, particularly in cohorts where the prevalence of CAD was low. Preliminary data suggest that CAD risk was significantly lower in RA and IBD compared to DM.National Institutes of Health (U.S.). Informatics for Integrating Biology and the Bedside Project (U54LM008748
- …