30 research outputs found
Effect of optimization framework on rigid and non-rigid multimodal image registration
The process of transforming or aligning two images is known as image registration. In the present era, image registration is one of the most popular transformation tools in case of, for example, satellite as well as medical imaging analysis. Images captured by difference devices that can be processed under same registration model are called multimodal images. In this work, we present a multimodal image registration framework, upon which ant colony optimization (ACO) and flower pollination algorithms (FPA), which are two meta heuristics algorithms, are applied in order to improve the performance of a proposed rigid and non-rigid multimodal registration framework and decrease its processing time. The results of the ACO and FPA based framework were compared against particle swarm optimization and Genetic algorithm-based framework's results and seem to be promising
Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering
Medical visual question answering (VQA) is a challenging task that requires
answering clinical questions of a given medical image, by taking consider of
both visual and language information. However, due to the small scale of
training data for medical VQA, pre-training fine-tuning paradigms have been a
commonly used solution to improve model generalization performance. In this
paper, we present a novel self-supervised approach that learns unimodal and
multimodal feature representations of input images and text using medical image
caption datasets, by leveraging both unimodal and multimodal contrastive
losses, along with masked language modeling and image text matching as
pretraining objectives. The pre-trained model is then transferred to downstream
medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA)
performance on three publicly available medical VQA datasets with significant
accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we
conduct a comprehensive analysis to validate the effectiveness of different
components of the approach and study different pre-training settings. Our codes
and models are available at https://github.com/pengfeiliHEU/MUMC.Comment: accepted by MICCAI202
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
With the availability of large-scale, comprehensive, and general-purpose
vision-language (VL) datasets such as MSCOCO, vision-language pre-training
(VLP) has become an active area of research and proven to be effective for
various VL tasks such as visual-question answering. However, studies on VLP in
the medical domain have so far been scanty. To provide a comprehensive
perspective on VLP for medical VL tasks, we conduct a thorough experimental
analysis to study key factors that may affect the performance of VLP with a
unified vision-language Transformer. To allow making sound and quick
pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality,
multi-modality radiographic dataset containing 18,434 image-caption pairs
collected from an open-access online database MedPix. RGC can be used as a
pre-training dataset or a new benchmark for medical report generation and
medical image-text retrieval. By utilizing RGC and other available datasets for
pre-training, we develop several key insights that can guide future medical VLP
research and new strong baselines for various medical VL tasks.Comment: Published as oral paper in CHIL 202
ImageCLEF 2020: Multimedia Retrieval in Lifelogging, Medical, Nature, and Security Applications
This paper presents an overview of the 2020 ImageCLEF lab that will be organized as part of the Conference and Labs of the Evaluation Forum - CLEF Labs 2020 in Thessaloniki, Greece. ImageCLEF is an ongoing evaluation initiative (run since 2003) that promotes the evaluation of technologies for annotation, indexing and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In 2020, the 18th edition of ImageCLEF will organize four main tasks: (i) a Lifelog task (videos, images and other sources) about daily activity understanding, retrieval and summarization, (ii) a Medical task that groups three previous tasks (caption analysis, tuberculosis prediction, and medical visual question answering) with new data and adapted tasks, (iii) a Coral task about segmenting and labeling collections of coral images for 3D modeling, and a new (iv) Web user interface task addressing the problems of detecting and recognizing hand drawn website UIs (User Interfaces) for generating automatic code. The strong participation, with over 235 research groups registering and 63 submitting over 359 runs for the tasks in 2019 shows an important interest in this benchmarking campaign. We expect the new tasks to attract at least as many researchers for 2020
Effect of optimization framework on Rigid and Non-rigid Multimodal Image Registration
The process of transforming or aligning two images is known as image registration. In the present era, image registration is one of the most popular transformation tools in case of, for example, satellite as well as medical imaging analysis. Images captured by difference devices that can be processed under same registration model are called multimodal images. In this work, we present a multimodal image registration framework, upon which ant colony optimization (ACO) and flower pollination algorithms (FPA), which are two meta heuristics algorithms, are applied in order to improve the performance of a proposed rigid and non-rigid multimodal registration framework and decrease its processing time. The results of the ACO and FPA based framework were compared against particle swarm optimization and Genetic algorithm-based framework's results and seem to be promising
Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning
With the development of multimodality and large language models, the deep
learning-based technique for medical image captioning holds the potential to
offer valuable diagnostic recommendations. However, current generic text and
image pre-trained models do not yield satisfactory results when it comes to
describing intricate details within medical images. In this paper, we present a
novel medical image captioning method guided by the segment anything model
(SAM) to enable enhanced encoding with both general and detailed feature
extraction. In addition, our approach employs a distinctive pre-training
strategy with mixed semantic learning to simultaneously capture both the
overall information and finer details within medical images. We demonstrate the
effectiveness of this approach, as it outperforms the pre-trained BLIP2 model
on various evaluation metrics for generating descriptions of medical images
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
In this paper, we focus on the problem of Medical Visual Question Answering
(MedVQA), which is crucial in efficiently interpreting medical images with
vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as
a generation task that naturally follows the human-machine interaction, we
propose a generative-based model for medical visual understanding by aligning
visual information from a pre-trained vision encoder with a large language
model. Secondly, we establish a scalable pipeline to construct a large-scale
medical visual question-answering dataset, named PMC-VQA, which contains 227k
VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we
pre-train our proposed model on PMC-VQA and then fine-tune it on multiple
public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a
large margin. Additionally, we propose a test set that has undergone manual
verification, which is significantly more challenging, even the best models
struggle to solve
Free Form Medical Visual Question Answering in Radiology
Visual Question Answering (VQA) in the medical domain presents a unique,
interdisciplinary challenge, combining fields such as Computer Vision, Natural
Language Processing, and Knowledge Representation. Despite its importance,
research in medical VQA has been scant, only gaining momentum since 2018.
Addressing this gap, our research delves into the effective representation of
radiology images and the joint learning of multimodal representations,
surpassing existing methods. We innovatively augment the SLAKE dataset,
enabling our model to respond to a more diverse array of questions, not limited
to the immediate content of radiology or pathology images. Our model achieves a
top-1 accuracy of 79.55\% with a less complex architecture, demonstrating
comparable performance to current state-of-the-art models. This research not
only advances medical VQA but also opens avenues for practical applications in
diagnostic settings.Comment: 6 pages and 4 figure
Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering
ChatGPT explores a strategic blueprint of question answering (QA) in
delivering medical diagnosis, treatment recommendations, and other healthcare
support. This is achieved through the increasing incorporation of medical
domain data via natural language processing (NLP) and multimodal paradigms. By
transitioning the distribution of text, images, videos, and other modalities
from the general domain to the medical domain, these techniques have expedited
the progress of medical domain question answering (MDQA). They bridge the gap
between human natural language and sophisticated medical domain knowledge or
expert manual annotations, handling large-scale, diverse, unbalanced, or even
unlabeled data analysis scenarios in medical contexts. Central to our focus is
the utilizing of language models and multimodal paradigms for medical question
answering, aiming to guide the research community in selecting appropriate
mechanisms for their specific medical research requirements. Specialized tasks
such as unimodal-related question answering, reading comprehension, reasoning,
diagnosis, relation extraction, probability modeling, and others, as well as
multimodal-related tasks like vision question answering, image caption,
cross-modal retrieval, report summarization, and generation, are discussed in
detail. Each section delves into the intricate specifics of the respective
method under consideration. This paper highlights the structures and
advancements of medical domain explorations against general domain methods,
emphasizing their applications across different tasks and datasets. It also
outlines current challenges and opportunities for future medical domain
research, paving the way for continued innovation and application in this
rapidly evolving field.Comment: 50 pages, 3 figures, 3 table
Customizing General-Purpose Foundation Models for Medical Report Generation
Medical caption prediction which can be regarded as a task of medical report
generation (MRG), requires the automatic generation of coherent and accurate
captions for the given medical images. However, the scarcity of labelled
medical image-report pairs presents great challenges in the development of deep
and large-scale neural networks capable of harnessing the potential artificial
general intelligence power like large language models (LLMs). In this work, we
propose customizing off-the-shelf general-purpose large-scale pre-trained
models, i.e., foundation models (FMs), in computer vision and natural language
processing with a specific focus on medical report generation. Specifically,
following BLIP-2, a state-of-the-art vision-language pre-training approach, we
introduce our encoder-decoder-based MRG model. This model utilizes a
lightweight query Transformer to connect two FMs: the giant vision Transformer
EVA-ViT-g and a bilingual LLM trained to align with human intentions (referred
to as ChatGLM-6B). Furthermore, we conduct ablative experiments on the
trainable components of the model to identify the crucial factors for effective
transfer learning. Our findings demonstrate that unfreezing EVA-ViT-g to learn
medical image representations, followed by parameter-efficient training of
ChatGLM-6B to capture the writing styles of medical reports, is essential for
achieving optimal results. Our best attempt (PCLmed Team) achieved the 4th and
the 2nd, respectively, out of 13 participating teams, based on the BERTScore
and ROUGE-1 metrics, in the ImageCLEFmedical Caption 2023 Caption Prediction
Task competition.Comment: 14 pages, 3 figure