Search CORE

7 research outputs found

Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition

Author: Belkadi Samuel
Han Lifeng
Nenadic Goran
Wu Yuping
Publication venue
Publication date: 30/10/2023
Field of study

The practice of fine-tuning Pre-trained Language Models (PLMs) from general or domain-specific data to a specific task with limited resources, has gained popularity within the field of natural language processing (NLP). In this work, we re-visit this assumption and carry out an investigation in clinical NLP, specifically Named Entity Recognition on drugs and their related attributes. We compare Transformer models that are trained from scratch to fine-tuned BERT-based LLMs namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine the impact of an additional CRF layer on such models to encourage contextual learning. We use n2c2-2018 shared task data for model development and evaluations. The experimental outcomes show that 1) CRF layers improved all language models; 2) referring to BIO-strict span level evaluation using macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the TransformerCRF model trained from scratch achieved 0.78+, demonstrating comparable performances with much lower cost - e.g. with 39.80\% less training parameters; 3) referring to BIO-strict span-level evaluation using weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF exhibited lower score differences, with 97.59\%/97.44\%/96.84\% respectively. 4) applying efficient training by down-sampling for better data distribution further reduced the training cost and need for data, while maintaining similar scores - i.e. around 0.02 points lower compared to using the full dataset. Our models will be hosted at \url{https://github.com/HECTA-UoM/TransformerCRF}Comment: working paper - Large Language Models, Fine-tuning LLMs, Clinical NLP, Medication Mining, AI for Healthcar

arXiv.org e-Print Archive

Generating Medical Prescriptions with Conditional Transformer

Author: Belkadi Samuel
Del-Pinto Warren
Han Lifeng
Micheletti Nicolo
Nenadic Goran
Publication venue
Publication date: 18/11/2023
Field of study

Access to real-world medication prescriptions is essential for medical research and healthcare quality improvement. However, access to real medication prescriptions is often limited due to the sensitive nature of the information expressed. Additionally, manually labelling these instructions for training and fine-tuning Natural Language Processing (NLP) models can be tedious and expensive. We introduce a novel task-specific model architecture, Label-To-Text-Transformer (\textbf{LT3}), tailored to generate synthetic medication prescriptions based on provided labels, such as a vocabulary list of medications and their attributes. LT3 is trained on a set of around 2K lines of medication prescriptions extracted from the MIMIC-III database, allowing the model to produce valuable synthetic medication prescriptions. We evaluate LT3's performance by contrasting it with a state-of-the-art Pre-trained Language Model (PLM), T5, analysing the quality and diversity of generated texts. We deploy the generated synthetic data to train the SpacyNER model for the Named Entity Recognition (NER) task over the n2c2-2018 dataset. The experiments show that the model trained on synthetic data can achieve a 96-98\% F1 score at Label Recognition on Drug, Frequency, Route, Strength, and Form. LT3 codes and data will be shared at \url{https://github.com/HECTA-UoM/Label-To-Text-Transformer}Comment: Accepted to: Workshop on Synthetic Data Generation with Generative AI (SyntheticData4ML Workshop) at NeurIPS 202

arXiv.org e-Print Archive

Large Language Models and Control Mechanisms Improve Text Readability of Biomedical Abstracts

Author: Belkadi Samuel
Han Lifeng
Li Zihao
Micheletti Nicolo
Nenadic Goran
Shardlow Matthew
Publication venue
Publication date: 22/09/2023
Field of study

Biomedical literature often uses complex language and inaccessible professional terminologies. That is why simplification plays an important role in improving public health literacy. Applying Natural Language Processing (NLP) models to automate such tasks allows for quick and direct accessibility for lay readers. In this work, we investigate the ability of state-of-the-art large language models (LLMs) on the task of biomedical abstract simplification, using the publicly available dataset for plain language adaptation of biomedical abstracts (\textbf{PLABA}). The methods applied include domain fine-tuning and prompt-based learning (PBL) on: 1) Encoder-decoder models (T5, SciFive, and BART), 2) Decoder-only GPT models (GPT-3.5 and GPT-4) from OpenAI and BioGPT, and 3) Control-token mechanisms on BART-based models. We used a range of automatic evaluation metrics, including BLEU, ROUGE, SARI, and BERTscore, and also conducted human evaluations. BART-Large with Control Token (BART-L-w-CT) mechanisms reported the highest SARI score of 46.54 and T5-base reported the highest BERTscore 72.62. In human evaluation, BART-L-w-CTs achieved a better simplicity score over T5-Base (2.9 vs. 2.2), while T5-Base achieved a better meaning preservation score over BART-L-w-CTs (3.1 vs. 2.6). We also categorised the system outputs with examples, hoping this will shed some light for future research on this task. Our code, fine-tuned models, and data splits are available at \url{https://github.com/HECTA-UoM/PLABA-MU}Comment: working pape

arXiv.org e-Print Archive

Opportunities and Challenges for Molecular Understanding of Ciliopathies-The 100,000 Genomes Project.

Author: Ajzenberg
Amirav
An
Arts
Bachmann-Gagescu
Belkadi
Beltran
Beltran
Best
Bibikova
Blum
Boaretto
Boon
Boycott
Boyd
Branham
Budny
Bujakowska
Bujakowska
Burnight
Buskin
Cardenas-Rodriguez
Chamling
Cong
Consortium
Coppieters
Davis
Davis
Deng
Deng
Dheensa
Drivas
Dunn
Eisenberger
Ellingford
Estrada-Cuzcano
Fassad
Fiorentino
Firth
Fliegauf
Fokkema
Forsythe
Gainotti
Gaudelli
Gaush
Goetz
Graham
Gurrieri
Hamada
Harris
Hartill
Hildebrandt
Hirst
Huber
Hynes
Irving
Jainchill
Jones
Karczewski
Katsanis
Kenny
Khanna
Khera
Kim
Kim
King
Knopp
Knowles
Knowles
Koblan
Kodra
Komlosi
Komor
Konishi
Krusche
Kuek
Kuwahara
Köhler
Lakowski
Landrum
Langousis
Lee
Lek
Li
Li
Lindstrand
Lord
Lucas
Lucas
Lucas
Marshall
Marshall
McIntyre
McIntyre
Mestek-Boukhibar
Meyer
Mitchison
Mitchison
Moayyeri
Mok
Molinari
Moore
Moss
Mossotto
Norris
Nouri
Ormondroyd
Oud
Paff
Parisi
Perantoni
Pollard
Project Team
Rafferty
Rambhatla
Ran
Rauchman
Reiter
Robinson
Samuel
Sawyer
Schmidts
Schmidts
Schock
Shoemark
Simpson
Singla
Slaats
Soden
Song
Song
Spassky
Srivastava
Stenson
Thiel
Turnbull
Vervoort
Vincensini
Walentek
Waters
Welch
Wheway
Wheway
Wheway
Willey
Wolf
Wood
Wright
Wright
Wu
You
Zaki
Zhang
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2019
Field of study

Cilia are highly specialized cellular organelles that serve multiple functions in human development and health. Their central importance in the body is demonstrated by the occurrence of a diverse range of developmental disorders that arise from defects of cilia structure and function, caused by a range of different inherited mutations found in more than 150 different genes. Genetic analysis has rapidly advanced our understanding of the cell biological basis of ciliopathies over the past two decades, with more recent technological advances in genomics rapidly accelerating this progress. The 100,000 Genomes Project was launched in 2012 in the UK to improve diagnosis and future care for individuals affected by rare diseases like ciliopathies, through whole genome sequencing (WGS). In this review we discuss the potential promise and medical impact of WGS for ciliopathies and report on current progress of the 100,000 Genomes Project, reviewing the medical, technical and ethical challenges and opportunities that new, large scale initiatives such as this can offer

Southampton (e-Prints Soton)

Crossref

Directory of Open Access Journals

UCL Discovery

Queen Mary Research Online

Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition

Author: Belkadi Samuel
Han Lifeng
Nenadic Goran
Wu Yuping
Publication venue
Publication date: 15/12/2023
Field of study

The practice of fine-tuning Pre-trained Language Models (PLMs) from general or domain-specific data to a specific task with limited resources, has gained popularity within the field of natural language processing (NLP). In this work, we re-visit this assumption and carry out an investigation in clinical NLP, specifically Named Entity Recognition (NER) on drugs and their related attributes. We compare Transformer models that are trained from scratch to fine-tuned BERT-based Large Language Models (LLMs) namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine the impact of an additional Conditional Random Field (CRF) layer on such models to encourage contextual learning. We use n2c2-2018 shared task data for model development and evaluations. The experimental outcomes show that 1) CRF layers improved all language models; 2) referring to BIO-strict span level evaluation using macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the TransformerCRF model trained from scratch achieved 0.78+, demonstrating comparable performances with much lower cost, e.g. with 39.80% less training parameters; 3) referring to BIO-strict span-level evaluation using weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF exhibited lower score differences, with 97.59%/97.44%/96.84% respectively. 4) applying efficient training by down-sampling for better data distribution further reduced the training cost and need for data, while maintaining similar scores -i.e. around 0.02 points lower compared to using the full dataset. This This TRANSFORMERCRF project is hosted at https://github.com/HECTA-UoM/TransformerCRF</p

The University of Manchester - Institutional Repository

Generating Medical Instructions with Conditional Transformer

Author: Belkadi Samuel
Del-Pinto Warren
Han Lifeng
Micheletti Nicolo
Nenadic Goran
Publication venue
Publication date: 01/01/2023
Field of study

The University of Manchester - Institutional Repository

Development of simple and transferable molecular models for biodiesel production with the soft-SAFT equation of state

The knowledge of fatty acid esters/biodiesels thermodynamic properties is crucial not only for developing optimal biodiesel production and purification processes, but also for enhancing biodiesels performance in engines. This work is intended to apply a simple but reliable theoretically based sound model, the soft-SAFT EoS, as a tool for the development, design, scale-up, and optimization of biodiesels production and purification processes. A molecular model within the soft-SAFT EoS framework is proposed for the fatty acid esters, and the Density Gradient Theory approach is coupled into soft-SAFT for the description of interfacial properties, while the Free-Volume Theory is used for the calculation of viscosities, in an integrated model. For pressures up to 150 MPa, and in the temperature range 288.15-423.15K, density, surface tension, viscosity and speed of sound data for fatty acid methyl and ethyl esters, ranging from C-8:0 to C-24:0, with up to three unsaturated bonds, are described with deviations inferior to 5%. Finally, in order to validate the predictive ability of the model to be applied in the biodiesel groundwork, the high pressure densities and viscosities for 8 biodiesels were predicted with the soft-SAFT EoS, reinforcing the validity of the approach to obtain reliable predictions for engineering purposes. (C) 2014 The Institution of Chemical Engineers. Published by Elsevier B.V. All rights reserved

Crossref

Repositório Institucional da Universidade de Aveiro