16 research outputs found
Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models
In the era of digital healthcare, the huge volumes of textual information
generated every day in hospitals constitute an essential but underused asset
that could be exploited with task-specific, fine-tuned biomedical language
representation models, improving patient care and management. For such
specialized domains, previous research has shown that fine-tuning models
stemming from broad-coverage checkpoints can largely benefit additional
training rounds over large-scale in-domain resources. However, these resources
are often unreachable for less-resourced languages like Italian, preventing
local medical institutions to employ in-domain adaptation. In order to reduce
this gap, our work investigates two accessible approaches to derive biomedical
language models in languages other than English, taking Italian as a concrete
use-case: one based on neural machine translation of English resources,
favoring quantity over quality; the other based on a high-grade, narrow-scoped
corpus natively written in Italian, thus preferring quality over quantity. Our
study shows that data quantity is a harder constraint than data quality for
biomedical adaptation, but the concatenation of high-quality data can improve
model performance even when dealing with relatively size-limited corpora. The
models published from our investigations have the potential to unlock important
research opportunities for Italian hospitals and academia. Finally, the set of
lessons learned from the study constitutes valuable insights towards a solution
to build biomedical language models that are generalizable to other
less-resourced languages and different domain settings.Comment: 8 pages, 2 figures, 6 tables. Published in Journal of Biomedical
Informatic
Advancing Italian Biomedical Information Extraction with Large Language Models: Methodological Insights and Multicenter Practical Application
The introduction of computerized medical records in hospitals has reduced
burdensome operations like manual writing and information fetching. However,
the data contained in medical records are still far underutilized, primarily
because extracting them from unstructured textual medical records takes time
and effort. Information Extraction, a subfield of Natural Language Processing,
can help clinical practitioners overcome this limitation, using automated
text-mining pipelines. In this work, we created the first Italian
neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to
develop a Large Language Model for this task. Moreover, we conducted several
experiments with three external independent datasets to implement an effective
multicenter model, with overall F1-score 84.77%, Precision 83.16%, Recall
86.44%. The lessons learned are: (i) the crucial role of a consistent
annotation process and (ii) a fine-tuning strategy that combines classical
methods with a "few-shot" approach. This allowed us to establish methodological
guidelines that pave the way for future implementations in this field and allow
Italian hospitals to tap into important research opportunities
A synthetic dataset of liver disorder patients
The data in this article include 10,000 synthetic patients with liver disorders, characterized by 70 different variables, including clinical features, and patient outcomes, such as hospital admission or surgery. Patient data are generated, simulating as close as possible real patient data, using a publicly available Bayesian network describing a casual model for liver disorders. By varying the network parameters, we also generated an additional set of 500 patients with characteristics that deviated from the initial patient population. We provide an overview of the synthetic data generation process and the associated scripts for generating the cohorts. This dataset can be useful for the machine learning models training and validation, especially under the effect of dataset shift between training and testing sets