8,609 research outputs found
Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
Medical Visual Question Answering (VQA) is an important challenge, as it
would lead to faster and more accurate diagnoses and treatment decisions. Most
existing methods approach it as a multi-class classification problem, which
restricts the outcome to a predefined closed-set of curated answers. We focus
on open-ended VQA and motivated by the recent advances in language models
consider it as a generative task. Leveraging pre-trained language models, we
introduce a novel method particularly suited for small, domain-specific,
medical datasets. To properly communicate the medical images to the language
model, we develop a network that maps the extracted visual features to a set of
learnable tokens. Then, alongside the question, these learnable tokens directly
prompt the language model. We explore recent parameter-efficient fine-tuning
strategies for language models, which allow for resource- and data-efficient
fine-tuning. We evaluate our approach on the prime medical VQA benchmarks,
namely, Slake, OVQA and PathVQA. The results demonstrate that our approach
outperforms existing methods across various training settings while also being
computationally efficient
Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.</p
Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering
ChatGPT explores a strategic blueprint of question answering (QA) in
delivering medical diagnosis, treatment recommendations, and other healthcare
support. This is achieved through the increasing incorporation of medical
domain data via natural language processing (NLP) and multimodal paradigms. By
transitioning the distribution of text, images, videos, and other modalities
from the general domain to the medical domain, these techniques have expedited
the progress of medical domain question answering (MDQA). They bridge the gap
between human natural language and sophisticated medical domain knowledge or
expert manual annotations, handling large-scale, diverse, unbalanced, or even
unlabeled data analysis scenarios in medical contexts. Central to our focus is
the utilizing of language models and multimodal paradigms for medical question
answering, aiming to guide the research community in selecting appropriate
mechanisms for their specific medical research requirements. Specialized tasks
such as unimodal-related question answering, reading comprehension, reasoning,
diagnosis, relation extraction, probability modeling, and others, as well as
multimodal-related tasks like vision question answering, image caption,
cross-modal retrieval, report summarization, and generation, are discussed in
detail. Each section delves into the intricate specifics of the respective
method under consideration. This paper highlights the structures and
advancements of medical domain explorations against general domain methods,
emphasizing their applications across different tasks and datasets. It also
outlines current challenges and opportunities for future medical domain
research, paving the way for continued innovation and application in this
rapidly evolving field.Comment: 50 pages, 3 figures, 3 table
Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities
The auditory system plays a substantial role in shaping the overall human
perceptual experience. While prevailing large language models (LLMs) and visual
language models (VLMs) have shown their promise in solving a wide variety of
vision and language understanding tasks, only a few of them can be generalised
to the audio domain without compromising their domain-specific capacity. In
this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending
LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT
applies an instruction-aware audio aligner to generate soft prompts,
conditioned on both input text and sounds, as language model inputs. To
mitigate the data scarcity in the audio domain, a multi-task learning strategy
is proposed by formulating diverse audio tasks in a sequence-to-sequence
manner. Moreover, we improve the framework of audio language model by using
interleaved audio-text embeddings as the input sequence. This improved
framework imposes zero constraints on the input format and thus is capable of
tackling more understanding tasks, such as few-shot audio classification and
audio reasoning. To further evaluate the reasoning ability of audio networks,
we propose natural language audio reasoning (NLAR), a new task that analyses
across two audio clips by comparison and summarization. Experiments show that
APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the
expert models (i.e., the networks trained on the targeted datasets) across
various tasks. We finally demonstrate the APT's ability in extending frozen
VLMs to the audio domain without finetuning, achieving promising results in the
audio-visual question and answering task. Our code and model weights are
released at https://github.com/JinhuaLiang/APT
CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios
This paper focuses on the challenge of answering questions
in scenarios that are composed of rich and complex dynamic audiovisual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are
sometimes ambiguous and fail to describe specific audio-visual events. To
overcome this limitation, we introduce the CAT, which enhances MLLM
in three ways: 1) besides straightforwardly bridging audio and video, we
design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required
for large language models. 2) CAT is trained on a mixed multimodal
dataset, allowing direct application in audio-visual scenarios. Notably,
we collect an audio-visual joint instruction dataset named AVinstruct,
to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference
optimization, a strategy specialized in retraining the model to favor the
non-ambiguity response and improve the ability to localize specific audiovisual objects. Extensive experimental results demonstrate that CAT
outperforms existing methods on multimodal tasks, especially in AudioVisual Question Answering (AVQA) tasks. The codes and the collected
instructions are released at https://github.com/rikeilong/Bay-CA
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Foundation models, large-scale, pre-trained deep-learning models adapted to a
wide range of downstream tasks have gained significant interest lately in
various deep-learning problems undergoing a paradigm shift with the rise of
these models. Trained on large-scale dataset to bridge the gap between
different modalities, foundation models facilitate contextual reasoning,
generalization, and prompt capabilities at test time. The predictions of these
models can be adjusted for new tasks by augmenting the model input with
task-specific hints called prompts without requiring extensive labeled data and
retraining. Capitalizing on the advances in computer vision, medical imaging
has also marked a growing interest in these models. To assist researchers in
navigating this direction, this survey intends to provide a comprehensive
overview of foundation models in the domain of medical imaging. Specifically,
we initiate our exploration by providing an exposition of the fundamental
concepts forming the basis of foundation models. Subsequently, we offer a
methodical taxonomy of foundation models within the medical domain, proposing a
classification system primarily structured around training strategies, while
also incorporating additional facets such as application domains, imaging
modalities, specific organs of interest, and the algorithms integral to these
models. Furthermore, we emphasize the practical use case of some selected
approaches and then discuss the opportunities, applications, and future
directions of these large-scale pre-trained models, for analyzing medical
images. In the same vein, we address the prevailing challenges and research
pathways associated with foundational models in medical imaging. These
encompass the areas of interpretability, data management, computational
requirements, and the nuanced issue of contextual comprehension.Comment: The paper is currently in the process of being prepared for
submission to MI
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Video Question Answering (VideoQA) has been significantly advanced from the
scaling of recent Large Language Models (LLMs). The key idea is to convert the
visual information into the language feature space so that the capacity of LLMs
can be fully exploited. Existing VideoQA methods typically take two paradigms:
(1) learning cross-modal alignment, and (2) using an off-the-shelf captioning
model to describe the visual data. However, the first design needs costly
training on many extra multi-modal data, whilst the second is further limited
by limited domain generalization. To address these limitations, a simple yet
effective Retrieving-to-Answer (R2A) framework is proposed.Given an input
video, R2A first retrieves a set of semantically similar texts from a generic
text corpus using a pre-trained multi-modal model (e.g., CLIP). With both the
question and the retrieved texts, a LLM (e.g., DeBERTa) can be directly used to
yield a desired answer. Without the need for cross-modal fine-tuning, R2A
allows for all the key components (e.g., LLM, retrieval model, and text corpus)
to plug-and-play. Extensive experiments on several VideoQA benchmarks show that
despite with 1.3B parameters and no fine-tuning, our R2A can outperform the 61
times larger Flamingo-80B model even additionally trained on nearly 2.1B
multi-modal data
- …