Safe adaptation of foundation large language models for clinical information extraction in aged care: Evaluating optimal methods

Abstract

Background: Despite the rapid digitisation of healthcare, extracting useful clinical information from unstructured electronic health records (EHRs) remains a major challenge due to inconsistencies and ambiguities in clinical data. Foundation large language models (LLMs) like GPT-4 show promise in addressing this issue because of their ability to understand and process natural language. However, these models are usually trained on general internet data, which may not include specialised healthcare terminology, leading to potential misunderstandings in clinical contexts. Additionally, using foundation LLMs in healthcare raises safety concerns about patient data privacy and the risk of biased or incorrect recommendations. Therefore, effective training methods such as prompt tuning, parameter-efficient fine-tuning (PEFT), and retrieval-augmented generation (RAG), along with rigorous evaluation, are essential to safely integrate these models into clinical workflows. However, the safe adaptation of these models into real-world clinical settings, such as residential aged care (RAC), is still uncertain due to several critical gaps: (1) limited knowledge about the performance of existing foundation LLMs for clinical IE, (2) lack of understanding of optimal domain adaptation strategies for these models, (3) reliance on synthetic or publicly available datasets with narrow task scopes, limiting their ability to capture the complexity and variability of real-world clinical environments, and (4) concerns about the trustworthiness of healthcare AI due to the absence of a holistic evaluation of foundation LLMs performance in clinical IE.Aim: To address the above challenges, this study explores methods and approaches for the safe and effective adaptation of foundation LLMs to clinical IE tasks in residential aged care facilities (RACFs) in Australia. This aim is achieved by reaching three objectives: (1) identifying the most effective foundation LLM types for clinical IE, (2) identifying the optimal training methods for adapting foundation LLMs for clinical IE, and (3) developing a holistic evaluation framework for foundation LLM performance in clinical IE.Methods: The following methods and approach were systematically applied: (1) comparing various foundation LLM types, including encoder-based (e.g., BioBERT, BlueBERT, CancerBERT, DDS-BERT, RuBERT, LABSE, EhrBERT, MedBERT, ClinicalBERT, Clinical BioBERT, Discharge Summary BERT, and Discharge Summary BioBERT), decoder-based (e.g., GPT-2, GPT-3, GPT2-Bio-Pt), and generative models, covering both general-purpose (e.g., Llama, Mistral, Gemini) and health-specific LLMs (e.g., MedAlpaca, Baize-Healthcare, ChatDoctor, Asclepius, PMC Llama, Me-Llama); (2) comparing different foundation LLM training approaches, including prompt tuning, PEFT, and RAG; (3) developing a comprehensive evaluation framework that assessed multiple dimensions beyond standard metrics (e.g., accuracy, F1-score), incorporating metrics such as robustness, fairness, bias, and relevance; and (4) evaluating foundation LLMs across various clinical IE tasks (e.g., named entity recognition (NER), summarisation) and clinical contexts (e.g., agitation in dementia, malnutrition) using free-text nursing notes from Australian RACFs.Results: Our findings demonstrate that (1) Among encoder-based, and decoder-based models developed between 2018 and 2022, encoder-based LLMs, particularly Clinical BioBERT, achieved slightly higher performance across clinical IE tasks, with an accuracy of 92% and an F1 score of 90% for NER task, a text classification accuracy of 94% and an F1 score of 94%, and a question answering accuracy of 91% and an F1 score of 92%. (2) Among generative LLMs, health-specific generative LLMs demonstrated slightly better performance than general-purpose generative LLMs, with Me-Llama emerging as the top performer in NER (accuracy: 75.18%, F1 score: 75.00%, robustness: 6.00%) and summarisation (accuracy: 79.01%, F1 score: 79.24%, robustness: 4.90%, relevance: 78.25%). (3) Fine-tuning an encoder-based LLMs, like Clinical BioBERT with deep learning components, such as bidirectional long short-term memory (BiLSTM) and conditional random field (CRF), outperforms the standalone Clinical BioBERT, achieving an F1 score of 75% and an accuracy of 78% in NER. (4) PEFT with zero-shot learning and RAG with few-shot learning, when applied to a Llama 3.1-8B Instruct model, achieved comparable performance levels, with PEFT with zero-shot attaining 89% accuracy, 88% precision, 88% recall, and an F1 score of 90%, while RAG with few-shot achieved 88% accuracy, 85% precision, 86% recall, and an F1 score of 87%. (4) Among RAG-integrated frameworks, the combination of LangChain and LlamaIndex, when applied to the Llama model, yielded the best results than using either framework individually. (5) Among different prompting strategies applied to the Llama model with RAG, including one-shot, three-shot, four-shot, and five-shot, three-shot provided optimal results for NER, while five-shot achieved the best results for summarisation. (6) A holistic evaluation framework was developed that includes accuracy, F1 score, robustness, fairness, bias, and relevance. Accuracy ensures the correctness of the information generated by the model, while the F1 score balances precision and recall, evaluating performance. Robustness assesses the model's ability to handle real-world variations and unexpected inputs, ensuring reliability in diverse scenarios. Fairness and bias metrics evaluate whether the model treats diverse patient groups and temporal cohorts equitably, preventing discriminatory outcomes. Finally, relevance measures how well the model’s outputs capture key and pertinent information from the input data. Together, these metrics create a comprehensive safety net, ensuring the model performs reliably, ethically, and effectively in clinical environments. This holistic evaluation framework was applied to Llama 3.1, which achieved 88.58% accuracy and an 87.43% F1 score in NER. In summarisation, it attained an 88.18% F1 score and 83.15% relevance. However, robustness remained low (4.00% for NER, 4.31% for summarisation), despite excellent fairness (99.9%) and minimal bias (0.11%) in both tasks.Conclusion: This study's findings include (1) identifying the most effective foundation LLM types, (2) determining optimal training strategies, and (3) developing a holistic evaluation framework that assesses accuracy, F1 score, robustness, fairness, bias, and relevance. These findings ensure the safe integration of foundation LLMs in real-world RAC settings. These findings pave the way for advanced clinical IE tools, such as NER and summarisation models, which can support clinical decision-making, enhance care quality, and ultimately lead to better outcomes for residents. Beyond RACFs, the insights from this study highlight the broader potential of LLMs in clinical IE across various healthcare settings, demonstrating their versatility and impact on the field. Future research can leverage this study's insights to enhance foundation LLMs and their training and evaluation methods, ensuring their safe, effective, and ethical use in clinical information retrieval, thereby improving clinical research and practice.</p

Similar works

Full text

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.