Search CORE

30 research outputs found

Effect of optimization framework on rigid and non-rigid multimodal image registration

Author: Chakraborty Sayan
Dey Nilanjan
González-Crespo Rubén
Pradhan Ratika
Tavares Joao Manuel R. S.
Publication venue: 'Science Society of Thailand'
Publication date: 19/01/2023
Field of study

The process of transforming or aligning two images is known as image registration. In the present era, image registration is one of the most popular transformation tools in case of, for example, satellite as well as medical imaging analysis. Images captured by difference devices that can be processed under same registration model are called multimodal images. In this work, we present a multimodal image registration framework, upon which ant colony optimization (ACO) and flower pollination algorithms (FPA), which are two meta heuristics algorithms, are applied in order to improve the performance of a proposed rigid and non-rigid multimodal registration framework and decrease its processing time. The results of the ACO and FPA based framework were compared against particle swarm optimization and Genetic algorithm-based framework's results and seem to be promising

Re-UNIR

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Author: He Jinlong
Li Pengfei
Liu Gang
Zhao Zixu
Zhong Shenjun
Publication venue
Publication date: 11/07/2023
Field of study

Medical visual question answering (VQA) is a challenging task that requires answering clinical questions of a given medical image, by taking consider of both visual and language information. However, due to the small scale of training data for medical VQA, pre-training fine-tuning paradigms have been a commonly used solution to improve model generalization performance. In this paper, we present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text using medical image caption datasets, by leveraging both unimodal and multimodal contrastive losses, along with masked language modeling and image text matching as pretraining objectives. The pre-trained model is then transferred to downstream medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets with significant accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we conduct a comprehensive analysis to validate the effectiveness of different components of the approach and study different pre-training settings. Our codes and models are available at https://github.com/pengfeiliHEU/MUMC.Comment: accepted by MICCAI202

arXiv.org e-Print Archive

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Author: Fan Lu
Khan Ameer Hamza
Liu Bo
Wu Xiao-Ming
Xu Li
Publication venue
Publication date: 24/08/2023
Field of study

With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.Comment: Published as oral paper in CHIL 202

arXiv.org e-Print Archive

ImageCLEF 2020: Multimedia Retrieval in Lifelogging, Medical, Nature, and Security Applications

Author: Abacha Asma Ben
Berari Raul
Brie Paul
Campello Antonio
Chamberlain Jon
Cid Yashin Dicente
Clark Adrian
Constantin Mihai Gabriel
Dang-Nguyen Duc-Tien
Datla Vivek
Demner-Fushman Dina
Dogariu Mihai
Fichou Dimitri
Friedrich Christoph M
Garcia Seco De Herrera Alba
Gurrin Cathal
Halvorsen På L
Hasan Sadid A
Ionescu Bogdan
Kovalev Vassili
Kozlovski Serge
Le Tu-Khiem
Liauchuk Vitali
Lux Mathias
Müller Henning
Ninh Van-Tu
Pelka Obioma
Piras Luca
Péteri Renaud
Riegler Michael
Stefan Liviu Daniel
Tran Minh-Triet
Zhou Liting
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

This paper presents an overview of the 2020 ImageCLEF lab that will be organized as part of the Conference and Labs of the Evaluation Forum - CLEF Labs 2020 in Thessaloniki, Greece. ImageCLEF is an ongoing evaluation initiative (run since 2003) that promotes the evaluation of technologies for annotation, indexing and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In 2020, the 18th edition of ImageCLEF will organize four main tasks: (i) a Lifelog task (videos, images and other sources) about daily activity understanding, retrieval and summarization, (ii) a Medical task that groups three previous tasks (caption analysis, tuberculosis prediction, and medical visual question answering) with new data and adapted tasks, (iii) a Coral task about segmenting and labeling collections of coral images for 3D modeling, and a new (iv) Web user interface task addressing the problems of detecting and recognizing hand drawn website UIs (User Interfaces) for generating automatic code. The strong participation, with over 235 research groups registering and 63 submitting over 359 runs for the tasks in 2019 shows an important interest in this benchmarking campaign. We expect the new tasks to attract at least as many researchers for 2020

University of Essex Research Repository

Crossref

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

Irish Universities

DCU Online Research Access Service

Effect of optimization framework on Rigid and Non-rigid Multimodal Image Registration

Author: João Manuel R. S. Tavares
Nilanjan Dey
Ratika Pradhan
Rubén González Crespo
Sayan Chakraborty
Publication venue: 'Science Society of Thailand'
Publication date: 01/02/2022
Field of study

Repositório Aberto da Universidade do Porto

Re-UNIR

Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning

Author: Guo Xuechen
Li Shiyan
Li Yizhi
Liang Weijie
Wang Benlu
Wang Gaoang
Wang Guanhong
Zhang Zhenyu
Publication venue
Publication date: 30/12/2023
Field of study

With the development of multimodality and large language models, the deep learning-based technique for medical image captioning holds the potential to offer valuable diagnostic recommendations. However, current generic text and image pre-trained models do not yield satisfactory results when it comes to describing intricate details within medical images. In this paper, we present a novel medical image captioning method guided by the segment anything model (SAM) to enable enhanced encoding with both general and detailed feature extraction. In addition, our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images. We demonstrate the effectiveness of this approach, as it outperforms the pre-trained BLIP2 model on various evaluation metrics for generating descriptions of medical images

arXiv.org e-Print Archive

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Author: Lin Weixiong
Wang Yanfeng
Wu Chaoyi
Xie Weidi
Zhang Xiaoman
Zhang Ya
Zhao Ziheng
Publication venue
Publication date: 29/05/2023
Field of study

In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction, we propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. Secondly, we establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we pre-train our proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a large margin. Additionally, we propose a test set that has undergone manual verification, which is significantly more challenging, even the best models struggle to solve

arXiv.org e-Print Archive

Free Form Medical Visual Question Answering in Radiology

Author: Cirrone Jacopo
Musthyala Rushabh
Narayanan Abhishek
Nistala Anirudh Prasad
Sankar Rahul
Singh Pranav
Publication venue
Publication date: 23/01/2024
Field of study

Visual Question Answering (VQA) in the medical domain presents a unique, interdisciplinary challenge, combining fields such as Computer Vision, Natural Language Processing, and Knowledge Representation. Despite its importance, research in medical VQA has been scant, only gaining momentum since 2018. Addressing this gap, our research delves into the effective representation of radiology images and the joint learning of multimodal representations, surpassing existing methods. We innovatively augment the SLAKE dataset, enabling our model to respond to a more diverse array of questions, not limited to the immediate content of radiology or pathology images. Our model achieves a top-1 accuracy of 79.55\% with a less complex architecture, demonstrating comparable performance to current state-of-the-art models. This research not only advances medical VQA but also opens avenues for practical applications in diagnostic settings.Comment: 6 pages and 4 figure

arXiv.org e-Print Archive

Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering

Author: Li Lei
Li Qing
Li Yu
Publication venue
Publication date: 20/01/2024
Field of study

ChatGPT explores a strategic blueprint of question answering (QA) in delivering medical diagnosis, treatment recommendations, and other healthcare support. This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms. By transitioning the distribution of text, images, videos, and other modalities from the general domain to the medical domain, these techniques have expedited the progress of medical domain question answering (MDQA). They bridge the gap between human natural language and sophisticated medical domain knowledge or expert manual annotations, handling large-scale, diverse, unbalanced, or even unlabeled data analysis scenarios in medical contexts. Central to our focus is the utilizing of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements. Specialized tasks such as unimodal-related question answering, reading comprehension, reasoning, diagnosis, relation extraction, probability modeling, and others, as well as multimodal-related tasks like vision question answering, image caption, cross-modal retrieval, report summarization, and generation, are discussed in detail. Each section delves into the intricate specifics of the respective method under consideration. This paper highlights the structures and advancements of medical domain explorations against general domain methods, emphasizing their applications across different tasks and datasets. It also outlines current challenges and opportunities for future medical domain research, paving the way for continued innovation and application in this rapidly evolving field.Comment: 50 pages, 3 figures, 3 table

arXiv.org e-Print Archive

Customizing General-Purpose Foundation Models for Medical Report Generation

Author: Raza Asif
Yang Bang
Zhang Tong
Zou Yuexian
Publication venue
Publication date: 08/06/2023
Field of study

Medical caption prediction which can be regarded as a task of medical report generation (MRG), requires the automatic generation of coherent and accurate captions for the given medical images. However, the scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks capable of harnessing the potential artificial general intelligence power like large language models (LLMs). In this work, we propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs), in computer vision and natural language processing with a specific focus on medical report generation. Specifically, following BLIP-2, a state-of-the-art vision-language pre-training approach, we introduce our encoder-decoder-based MRG model. This model utilizes a lightweight query Transformer to connect two FMs: the giant vision Transformer EVA-ViT-g and a bilingual LLM trained to align with human intentions (referred to as ChatGLM-6B). Furthermore, we conduct ablative experiments on the trainable components of the model to identify the crucial factors for effective transfer learning. Our findings demonstrate that unfreezing EVA-ViT-g to learn medical image representations, followed by parameter-efficient training of ChatGLM-6B to capture the writing styles of medical reports, is essential for achieving optimal results. Our best attempt (PCLmed Team) achieved the 4th and the 2nd, respectively, out of 13 participating teams, based on the BERTScore and ROUGE-1 metrics, in the ImageCLEFmedical Caption 2023 Caption Prediction Task competition.Comment: 14 pages, 3 figure

arXiv.org e-Print Archive