79 research outputs found
SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery
Advances in GPT-based large language models (LLMs) are revolutionizing
natural language processing, exponentially increasing its use across various
domains. Incorporating uni-directional attention, these autoregressive LLMs can
generate long and coherent paragraphs. However, for visual question answering
(VQA) tasks that require both vision and language processing, models with
bi-directional attention or models employing fusion techniques are often
employed to capture the context of multiple modalities all at once. As GPT does
not natively process vision tokens, to exploit the advancements in GPT models
for VQA in robotic surgery, we design an end-to-end trainable Language-Vision
GPT (LV-GPT) model that expands the GPT2 model to include vision input (image).
The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and
vision token embedding (token type and pose). Given the limitations of
unidirectional attention in GPT models and their ability to generate coherent
long paragraphs, we carefully sequence the word tokens before vision tokens,
mimicking the human thought process of understanding the question to infer an
answer from an image. Quantitatively, we prove that the LV-GPT model
outperforms other state-of-the-art VQA models on two publically available
surgical-VQA datasets (based on endoscopic vision challenge robotic scene
segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset
(based on the holistic surgical scene dataset). We further annotate all three
datasets to include question-type annotations to allow sub-type analysis.
Furthermore, we extensively study and present the effects of token sequencing,
token type and pose embedding for vision tokens in the LV-GPT model.Comment: The manuscript is accepted in MICCAI 2023. Code are available at:
https://github.com/lalithjets/SurgicalGP
Improving Visual Question Answering by Referring to Generated Paragraph Captions
Paragraph-style image captions describe diverse aspects of an image as
opposed to the more common single-sentence captions that only provide an
abstract description of the image. These paragraph captions can hence contain
substantial information of the image for tasks such as visual question
answering. Moreover, this textual information is complementary with visual
information present in the image because it can discuss both more abstract
concepts and more explicit, intermediate symbolic information about objects,
events, and scenes that can directly be matched with the textual question and
copied into the textual answer (i.e., via easier modality match). Hence, we
propose a combined Visual and Textual Question Answering (VTQA) model which
takes as input a paragraph caption as well as the corresponding image, and
answers the given question based on both inputs. In our model, the inputs are
fused to extract related information by cross-attention (early fusion), then
fused again in the form of consensus (late fusion), and finally expected
answers are given an extra score to enhance the chance of selection (later
fusion). Empirical results show that paragraph captions, even when
automatically generated (via an RL-based encoder-decoder model), help correctly
answer more visual questions. Overall, our joint model, when trained on the
Visual Genome dataset, significantly improves the VQA performance over a strong
baseline model.Comment: ACL 2019 (7 pages
Novel approach to integrate various feature extraction techniques for the Visual Question Answering System with skeletal images in the healthcare sector
In the realm of medical science, one of the most challenging concepts to grasp is the Medical Imaging Query Response System. The comprehension and classification of the diverse representations of the human body require a significant degree of effort and expertise. Furthermore, it is imperative for users within the healthcare sector to rigorously validate the system. In the domain of human health, a plethora of imaging techniques, including MRI, CT, ultrasound, X-ray, PET-CT, and others, play a pivotal role in the identification of medical issues. These technologies are instrumental in supporting both patient engagement and clinical decision-making. However, the utilization of models, techniques, and datasets for processing textual and visual information introduces complexities that can at times impede the provision of pertinent clinical solutions. The overarching objective of the proposed approach is to conduct a comprehensive comparative analysis of various feature extraction methodologies for both visual and textual information within the Visual Question Answering (VQA) system, focusing on human skeletal images. This endeavor is aimed at enhancing the VQA system's performance with newer datasets and addressing any limitations inherent in existing models. In addition, this research initiative seeks to enable researchers to identify and optimize novel methods that enhance the accuracy of the VQA system. The models under scrutiny in this analysis encompass various methods of feature extraction that help to improve the model and quality of the healthcare industry. The researcher will find the proper methodology for different datasets. To gauge the efficacy of each model in delivering the desired outcomes, an array of metrics will be employed, including classification measurement accuracy, F-classification, C-true positive rate (CTPR), C-precision, C-recall, C-sensitivity, and C-false negative rate (FNR). These metrics are designed to enhance the accuracy of any dataset and optimize the performance of both visual and textual components to ensure accurate responses to the posed queries
- …