7 research outputs found

    Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition

    Full text link
    The practice of fine-tuning Pre-trained Language Models (PLMs) from general or domain-specific data to a specific task with limited resources, has gained popularity within the field of natural language processing (NLP). In this work, we re-visit this assumption and carry out an investigation in clinical NLP, specifically Named Entity Recognition on drugs and their related attributes. We compare Transformer models that are trained from scratch to fine-tuned BERT-based LLMs namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine the impact of an additional CRF layer on such models to encourage contextual learning. We use n2c2-2018 shared task data for model development and evaluations. The experimental outcomes show that 1) CRF layers improved all language models; 2) referring to BIO-strict span level evaluation using macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the TransformerCRF model trained from scratch achieved 0.78+, demonstrating comparable performances with much lower cost - e.g. with 39.80\% less training parameters; 3) referring to BIO-strict span-level evaluation using weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF exhibited lower score differences, with 97.59\%/97.44\%/96.84\% respectively. 4) applying efficient training by down-sampling for better data distribution further reduced the training cost and need for data, while maintaining similar scores - i.e. around 0.02 points lower compared to using the full dataset. Our models will be hosted at \url{https://github.com/HECTA-UoM/TransformerCRF}Comment: working paper - Large Language Models, Fine-tuning LLMs, Clinical NLP, Medication Mining, AI for Healthcar

    Generating Medical Prescriptions with Conditional Transformer

    Full text link
    Access to real-world medication prescriptions is essential for medical research and healthcare quality improvement. However, access to real medication prescriptions is often limited due to the sensitive nature of the information expressed. Additionally, manually labelling these instructions for training and fine-tuning Natural Language Processing (NLP) models can be tedious and expensive. We introduce a novel task-specific model architecture, Label-To-Text-Transformer (\textbf{LT3}), tailored to generate synthetic medication prescriptions based on provided labels, such as a vocabulary list of medications and their attributes. LT3 is trained on a set of around 2K lines of medication prescriptions extracted from the MIMIC-III database, allowing the model to produce valuable synthetic medication prescriptions. We evaluate LT3's performance by contrasting it with a state-of-the-art Pre-trained Language Model (PLM), T5, analysing the quality and diversity of generated texts. We deploy the generated synthetic data to train the SpacyNER model for the Named Entity Recognition (NER) task over the n2c2-2018 dataset. The experiments show that the model trained on synthetic data can achieve a 96-98\% F1 score at Label Recognition on Drug, Frequency, Route, Strength, and Form. LT3 codes and data will be shared at \url{https://github.com/HECTA-UoM/Label-To-Text-Transformer}Comment: Accepted to: Workshop on Synthetic Data Generation with Generative AI (SyntheticData4ML Workshop) at NeurIPS 202

    Large Language Models and Control Mechanisms Improve Text Readability of Biomedical Abstracts

    Full text link
    Biomedical literature often uses complex language and inaccessible professional terminologies. That is why simplification plays an important role in improving public health literacy. Applying Natural Language Processing (NLP) models to automate such tasks allows for quick and direct accessibility for lay readers. In this work, we investigate the ability of state-of-the-art large language models (LLMs) on the task of biomedical abstract simplification, using the publicly available dataset for plain language adaptation of biomedical abstracts (\textbf{PLABA}). The methods applied include domain fine-tuning and prompt-based learning (PBL) on: 1) Encoder-decoder models (T5, SciFive, and BART), 2) Decoder-only GPT models (GPT-3.5 and GPT-4) from OpenAI and BioGPT, and 3) Control-token mechanisms on BART-based models. We used a range of automatic evaluation metrics, including BLEU, ROUGE, SARI, and BERTscore, and also conducted human evaluations. BART-Large with Control Token (BART-L-w-CT) mechanisms reported the highest SARI score of 46.54 and T5-base reported the highest BERTscore 72.62. In human evaluation, BART-L-w-CTs achieved a better simplicity score over T5-Base (2.9 vs. 2.2), while T5-Base achieved a better meaning preservation score over BART-L-w-CTs (3.1 vs. 2.6). We also categorised the system outputs with examples, hoping this will shed some light for future research on this task. Our code, fine-tuned models, and data splits are available at \url{https://github.com/HECTA-UoM/PLABA-MU}Comment: working pape

    Opportunities and Challenges for Molecular Understanding of Ciliopathies-The 100,000 Genomes Project.

    Get PDF
    Cilia are highly specialized cellular organelles that serve multiple functions in human development and health. Their central importance in the body is demonstrated by the occurrence of a diverse range of developmental disorders that arise from defects of cilia structure and function, caused by a range of different inherited mutations found in more than 150 different genes. Genetic analysis has rapidly advanced our understanding of the cell biological basis of ciliopathies over the past two decades, with more recent technological advances in genomics rapidly accelerating this progress. The 100,000 Genomes Project was launched in 2012 in the UK to improve diagnosis and future care for individuals affected by rare diseases like ciliopathies, through whole genome sequencing (WGS). In this review we discuss the potential promise and medical impact of WGS for ciliopathies and report on current progress of the 100,000 Genomes Project, reviewing the medical, technical and ethical challenges and opportunities that new, large scale initiatives such as this can offer

    Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition

    No full text
    The practice of fine-tuning Pre-trained Language Models (PLMs) from general or domain-specific data to a specific task with limited resources, has gained popularity within the field of natural language processing (NLP). In this work, we re-visit this assumption and carry out an investigation in clinical NLP, specifically Named Entity Recognition (NER) on drugs and their related attributes. We compare Transformer models that are trained from scratch to fine-tuned BERT-based Large Language Models (LLMs) namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine the impact of an additional Conditional Random Field (CRF) layer on such models to encourage contextual learning. We use n2c2-2018 shared task data for model development and evaluations. The experimental outcomes show that 1) CRF layers improved all language models; 2) referring to BIO-strict span level evaluation using macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the TransformerCRF model trained from scratch achieved 0.78+, demonstrating comparable performances with much lower cost, e.g. with 39.80% less training parameters; 3) referring to BIO-strict span-level evaluation using weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF exhibited lower score differences, with 97.59%/97.44%/96.84% respectively. 4) applying efficient training by down-sampling for better data distribution further reduced the training cost and need for data, while maintaining similar scores -i.e. around 0.02 points lower compared to using the full dataset. This This TRANSFORMERCRF project is hosted at https://github.com/HECTA-UoM/TransformerCRF</p

    Development of simple and transferable molecular models for biodiesel production with the soft-SAFT equation of state

    No full text
    The knowledge of fatty acid esters/biodiesels thermodynamic properties is crucial not only for developing optimal biodiesel production and purification processes, but also for enhancing biodiesels performance in engines. This work is intended to apply a simple but reliable theoretically based sound model, the soft-SAFT EoS, as a tool for the development, design, scale-up, and optimization of biodiesels production and purification processes. A molecular model within the soft-SAFT EoS framework is proposed for the fatty acid esters, and the Density Gradient Theory approach is coupled into soft-SAFT for the description of interfacial properties, while the Free-Volume Theory is used for the calculation of viscosities, in an integrated model. For pressures up to 150 MPa, and in the temperature range 288.15-423.15K, density, surface tension, viscosity and speed of sound data for fatty acid methyl and ethyl esters, ranging from C-8:0 to C-24:0, with up to three unsaturated bonds, are described with deviations inferior to 5%. Finally, in order to validate the predictive ability of the model to be applied in the biodiesel groundwork, the high pressure densities and viscosities for 8 biodiesels were predicted with the soft-SAFT EoS, reinforcing the validity of the approach to obtain reliable predictions for engineering purposes. (C) 2014 The Institution of Chemical Engineers. Published by Elsevier B.V. All rights reserved
    corecore