15,917 research outputs found
How to Pre-Train Your Model? Comparison of Different Pre-Training Models for Biomedical Question Answering
Using deep learning models on small scale datasets would result in
overfitting. To overcome this problem, the process of pre-training a model and
fine-tuning it to the small scale dataset has been used extensively in domains
such as image processing. Similarly for question answering, pre-training and
fine-tuning can be done in several ways. Commonly reading comprehension models
are used for pre-training, but we show that other types of pre-training can
work better. We compare two pre-training models based on reading comprehension
and open domain question answering models and determine the performance when
fine-tuned and tested over BIOASQ question answering dataset. We find open
domain question answering model to be a better fit for this task rather than
reading comprehension model
A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks
Recently, Large Language Models (LLM) have demonstrated impressive capability
to solve a wide range of tasks. However, despite their success across various
tasks, no prior work has investigated their capability in the biomedical domain
yet. To this end, this paper aims to evaluate the performance of LLMs on
benchmark biomedical tasks. For this purpose, we conduct a comprehensive
evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets.
To the best of our knowledge, this is the first work that conducts an extensive
evaluation and comparison of various LLMs in the biomedical domain.
Interestingly, we find based on our evaluation that in biomedical datasets that
have smaller training sets, zero-shot LLMs even outperform the current
state-of-the-art fine-tuned biomedical models. This suggests that pretraining
on large text corpora makes LLMs quite specialized even in the biomedical
domain. We also find that not a single LLM can outperform other LLMs in all
tasks, with the performance of different LLMs may vary depending on the task.
While their performance is still quite poor in comparison to the biomedical
models that were fine-tuned on large training sets, our findings demonstrate
that LLMs have the potential to be a valuable tool for various biomedical tasks
that lack large annotated data.Comment: Extended version of the following BioNLP paper:
https://aclanthology.org/2023.bionlp-1.30/ (arXiv:2306.04504). arXiv admin
note: substantial text overlap with arXiv:2306.0450
Finding Answers from the Word of God: Domain Adaptation for Neural Networks in Biblical Question Answering
Question answering (QA) has significantly benefitted from deep learning
techniques in recent years. However, domain-specific QA remains a challenge due
to the significant amount of data required to train a neural network. This
paper studies the answer sentence selection task in the Bible domain and answer
questions by selecting relevant verses from the Bible. For this purpose, we
create a new dataset BibleQA based on bible trivia questions and propose three
neural network models for our task. We pre-train our models on a large-scale QA
dataset, SQuAD, and investigate the effect of transferring weights on model
accuracy. Furthermore, we also measure the model accuracies with different
answer context lengths and different Bible translations. We affirm that
transfer learning has a noticeable improvement in the model accuracy. We
achieve relatively good results with shorter context lengths, whereas longer
context lengths decreased model accuracy. We also find that using a more modern
Bible translation in the dataset has a positive effect on the task.Comment: The paper has been accepted at IJCNN 201
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers
ChatGPT is a large language model developed by OpenAI. Despite its impressive
performance across various tasks, no prior work has investigated its capability
in the biomedical domain yet. To this end, this paper aims to evaluate the
performance of ChatGPT on various benchmark biomedical tasks, such as relation
extraction, document classification, question answering, and summarization. To
the best of our knowledge, this is the first work that conducts an extensive
evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on
our evaluation that in biomedical datasets that have smaller training sets,
zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative
transformer models, such as BioGPT and BioBART. This suggests that ChatGPT's
pre-training on large text corpora makes it quite specialized even in the
biomedical domain. Our findings demonstrate that ChatGPT has the potential to
be a valuable tool for various tasks in the biomedical domain that lack large
annotated data.Comment: Accepted by BioNLP@ACL 202
AutoDiscern: Rating the Quality of Online Health Information with Hierarchical Encoder Attention-based Neural Networks
Patients increasingly turn to search engines and online content before, or in
place of, talking with a health professional. Low quality health information,
which is common on the internet, presents risks to the patient in the form of
misinformation and a possibly poorer relationship with their physician. To
address this, the DISCERN criteria (developed at University of Oxford) are used
to evaluate the quality of online health information. However, patients are
unlikely to take the time to apply these criteria to the health websites they
visit. We built an automated implementation of the DISCERN instrument (Brief
version) using machine learning models. We compared the performance of a
traditional model (Random Forest) with that of a hierarchical encoder
attention-based neural network (HEA) model using two language embeddings, BERT
and BioBERT. The HEA BERT and BioBERT models achieved average F1-macro scores
across all criteria of 0.75 and 0.74, respectively, outperforming the Random
Forest model (average F1-macro = 0.69). Overall, the neural network based
models achieved 81% and 86% average accuracy at 100% and 80% coverage,
respectively, compared to 94% manual rating accuracy. The attention mechanism
implemented in the HEA architectures not only provided 'model explainability'
by identifying reasonable supporting sentences for the documents fulfilling the
Brief DISCERN criteria, but also boosted F1 performance by 0.05 compared to the
same architecture without an attention mechanism. Our research suggests that it
is feasible to automate online health information quality assessment, which is
an important step towards empowering patients to become informed partners in
the healthcare process
Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task
Recently, ChatGPT and GPT-4 have emerged and gained immense global attention
due to their unparalleled performance in language processing. Despite
demonstrating impressive capability in various open-domain tasks, their
adequacy in highly specific fields like radiology remains untested. Radiology
presents unique linguistic phenomena distinct from open-domain data due to its
specificity and complexity. Assessing the performance of large language models
(LLMs) in such specific domains is crucial not only for a thorough evaluation
of their overall performance but also for providing valuable insights into
future model design directions: whether model design should be generic or
domain-specific. To this end, in this study, we evaluate the performance of
ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned
specifically on task-related data samples. We also conduct a comprehensive
investigation on ChatGPT/GPT-4's reasoning ability by introducing varying
levels of inference difficulty. Our results show that 1) GPT-4 outperforms
ChatGPT in the radiology NLI task; 2) other specifically fine-tuned models
require significant amounts of data samples to achieve comparable performance
to ChatGPT/GPT-4. These findings demonstrate that constructing a generic model
that is capable of solving various tasks across different domains is feasible
- …