7 research outputs found
Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition
The practice of fine-tuning Pre-trained Language Models (PLMs) from general
or domain-specific data to a specific task with limited resources, has gained
popularity within the field of natural language processing (NLP). In this work,
we re-visit this assumption and carry out an investigation in clinical NLP,
specifically Named Entity Recognition on drugs and their related attributes. We
compare Transformer models that are trained from scratch to fine-tuned
BERT-based LLMs namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine
the impact of an additional CRF layer on such models to encourage contextual
learning. We use n2c2-2018 shared task data for model development and
evaluations. The experimental outcomes show that 1) CRF layers improved all
language models; 2) referring to BIO-strict span level evaluation using
macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the
TransformerCRF model trained from scratch achieved 0.78+, demonstrating
comparable performances with much lower cost - e.g. with 39.80\% less training
parameters; 3) referring to BIO-strict span-level evaluation using
weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF
exhibited lower score differences, with 97.59\%/97.44\%/96.84\% respectively.
4) applying efficient training by down-sampling for better data distribution
further reduced the training cost and need for data, while maintaining similar
scores - i.e. around 0.02 points lower compared to using the full dataset. Our
models will be hosted at \url{https://github.com/HECTA-UoM/TransformerCRF}Comment: working paper - Large Language Models, Fine-tuning LLMs, Clinical
NLP, Medication Mining, AI for Healthcar
Generating Medical Prescriptions with Conditional Transformer
Access to real-world medication prescriptions is essential for medical
research and healthcare quality improvement. However, access to real medication
prescriptions is often limited due to the sensitive nature of the information
expressed. Additionally, manually labelling these instructions for training and
fine-tuning Natural Language Processing (NLP) models can be tedious and
expensive. We introduce a novel task-specific model architecture,
Label-To-Text-Transformer (\textbf{LT3}), tailored to generate synthetic
medication prescriptions based on provided labels, such as a vocabulary list of
medications and their attributes. LT3 is trained on a set of around 2K lines of
medication prescriptions extracted from the MIMIC-III database, allowing the
model to produce valuable synthetic medication prescriptions. We evaluate LT3's
performance by contrasting it with a state-of-the-art Pre-trained Language
Model (PLM), T5, analysing the quality and diversity of generated texts. We
deploy the generated synthetic data to train the SpacyNER model for the Named
Entity Recognition (NER) task over the n2c2-2018 dataset. The experiments show
that the model trained on synthetic data can achieve a 96-98\% F1 score at
Label Recognition on Drug, Frequency, Route, Strength, and Form. LT3 codes and
data will be shared at
\url{https://github.com/HECTA-UoM/Label-To-Text-Transformer}Comment: Accepted to: Workshop on Synthetic Data Generation with Generative AI
(SyntheticData4ML Workshop) at NeurIPS 202
Large Language Models and Control Mechanisms Improve Text Readability of Biomedical Abstracts
Biomedical literature often uses complex language and inaccessible
professional terminologies. That is why simplification plays an important role
in improving public health literacy. Applying Natural Language Processing (NLP)
models to automate such tasks allows for quick and direct accessibility for lay
readers. In this work, we investigate the ability of state-of-the-art large
language models (LLMs) on the task of biomedical abstract simplification, using
the publicly available dataset for plain language adaptation of biomedical
abstracts (\textbf{PLABA}). The methods applied include domain fine-tuning and
prompt-based learning (PBL) on: 1) Encoder-decoder models (T5, SciFive, and
BART), 2) Decoder-only GPT models (GPT-3.5 and GPT-4) from OpenAI and BioGPT,
and 3) Control-token mechanisms on BART-based models. We used a range of
automatic evaluation metrics, including BLEU, ROUGE, SARI, and BERTscore, and
also conducted human evaluations. BART-Large with Control Token (BART-L-w-CT)
mechanisms reported the highest SARI score of 46.54 and T5-base reported the
highest BERTscore 72.62. In human evaluation, BART-L-w-CTs achieved a better
simplicity score over T5-Base (2.9 vs. 2.2), while T5-Base achieved a better
meaning preservation score over BART-L-w-CTs (3.1 vs. 2.6). We also categorised
the system outputs with examples, hoping this will shed some light for future
research on this task. Our code, fine-tuned models, and data splits are
available at \url{https://github.com/HECTA-UoM/PLABA-MU}Comment: working pape
Opportunities and Challenges for Molecular Understanding of Ciliopathies-The 100,000 Genomes Project.
Cilia are highly specialized cellular organelles that serve multiple functions in human development and health. Their central importance in the body is demonstrated by the occurrence of a diverse range of developmental disorders that arise from defects of cilia structure and function, caused by a range of different inherited mutations found in more than 150 different genes. Genetic analysis has rapidly advanced our understanding of the cell biological basis of ciliopathies over the past two decades, with more recent technological advances in genomics rapidly accelerating this progress. The 100,000 Genomes Project was launched in 2012 in the UK to improve diagnosis and future care for individuals affected by rare diseases like ciliopathies, through whole genome sequencing (WGS). In this review we discuss the potential promise and medical impact of WGS for ciliopathies and report on current progress of the 100,000 Genomes Project, reviewing the medical, technical and ethical challenges and opportunities that new, large scale initiatives such as this can offer
Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition
The practice of fine-tuning Pre-trained Language Models (PLMs) from general or domain-specific data to a specific task with limited resources, has gained popularity within the field of natural language processing (NLP). In this work, we re-visit this assumption and carry out an investigation in clinical NLP, specifically Named Entity Recognition (NER) on drugs and their related attributes. We compare Transformer models that are trained from scratch to fine-tuned BERT-based Large Language Models (LLMs) namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine the impact of an additional Conditional Random Field (CRF) layer on such models to encourage contextual learning. We use n2c2-2018 shared task data for model development and evaluations. The experimental outcomes show that 1) CRF layers improved all language models; 2) referring to BIO-strict span level evaluation using macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the TransformerCRF model trained from scratch achieved 0.78+, demonstrating comparable performances with much lower cost, e.g. with 39.80% less training parameters; 3) referring to BIO-strict span-level evaluation using weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF exhibited lower score differences, with 97.59%/97.44%/96.84% respectively. 4) applying efficient training by down-sampling for better data distribution further reduced the training cost and need for data, while maintaining similar scores -i.e. around 0.02 points lower compared to using the full dataset. This This TRANSFORMERCRF project is hosted at https://github.com/HECTA-UoM/TransformerCRF</p
Development of simple and transferable molecular models for biodiesel production with the soft-SAFT equation of state
The knowledge of fatty acid esters/biodiesels thermodynamic properties is crucial not only for developing optimal biodiesel production and purification processes, but also for enhancing biodiesels performance in engines. This work is intended to apply a simple but reliable theoretically based sound model, the soft-SAFT EoS, as a tool for the development, design, scale-up, and optimization of biodiesels production and purification processes. A molecular model within the soft-SAFT EoS framework is proposed for the fatty acid esters, and the Density Gradient Theory approach is coupled into soft-SAFT for the description of interfacial properties, while the Free-Volume Theory is used for the calculation of viscosities, in an integrated model. For pressures up to 150 MPa, and in the temperature range 288.15-423.15K, density, surface tension, viscosity and speed of sound data for fatty acid methyl and ethyl esters, ranging from C-8:0 to C-24:0, with up to three unsaturated bonds, are described with deviations inferior to 5%. Finally, in order to validate the predictive ability of the model to be applied in the biodiesel groundwork, the high pressure densities and viscosities for 8 biodiesels were predicted with the soft-SAFT EoS, reinforcing the validity of the approach to obtain reliable predictions for engineering purposes. (C) 2014 The Institution of Chemical Engineers. Published by Elsevier B.V. All rights reserved