79 research outputs found

    SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

    Full text link
    Advances in GPT-based large language models (LLMs) are revolutionizing natural language processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision and language processing, models with bi-directional attention or models employing fusion techniques are often employed to capture the context of multiple modalities all at once. As GPT does not natively process vision tokens, to exploit the advancements in GPT models for VQA in robotic surgery, we design an end-to-end trainable Language-Vision GPT (LV-GPT) model that expands the GPT2 model to include vision input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and vision token embedding (token type and pose). Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image. Quantitatively, we prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets (based on endoscopic vision challenge robotic scene segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset (based on the holistic surgical scene dataset). We further annotate all three datasets to include question-type annotations to allow sub-type analysis. Furthermore, we extensively study and present the effects of token sequencing, token type and pose embedding for vision tokens in the LV-GPT model.Comment: The manuscript is accepted in MICCAI 2023. Code are available at: https://github.com/lalithjets/SurgicalGP

    Improving Visual Question Answering by Referring to Generated Paragraph Captions

    Full text link
    Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.Comment: ACL 2019 (7 pages

    Novel approach to integrate various feature extraction techniques for the Visual Question Answering System with skeletal images in the healthcare sector

    Get PDF
    In the realm of medical science, one of the most challenging concepts to grasp is the Medical Imaging Query Response System. The comprehension and classification of the diverse representations of the human body require a significant degree of effort and expertise. Furthermore, it is imperative for users within the healthcare sector to rigorously validate the system. In the domain of human health, a plethora of imaging techniques, including MRI, CT, ultrasound, X-ray, PET-CT, and others, play a pivotal role in the identification of medical issues. These technologies are instrumental in supporting both patient engagement and clinical decision-making. However, the utilization of models, techniques, and datasets for processing textual and visual information introduces complexities that can at times impede the provision of pertinent clinical solutions. The overarching objective of the proposed approach is to conduct a comprehensive comparative analysis of various feature extraction methodologies for both visual and textual information within the Visual Question Answering (VQA) system, focusing on human skeletal images. This endeavor is aimed at enhancing the VQA system's performance with newer datasets and addressing any limitations inherent in existing models. In addition, this research initiative seeks to enable researchers to identify and optimize novel methods that enhance the accuracy of the VQA system. The models under scrutiny in this analysis encompass various methods of feature extraction that help to improve the model and quality of the healthcare industry. The researcher will find the proper methodology for different datasets. To gauge the efficacy of each model in delivering the desired outcomes, an array of metrics will be employed, including classification measurement accuracy, F-classification, C-true positive rate (CTPR), C-precision, C-recall, C-sensitivity, and C-false negative rate (FNR). These metrics are designed to enhance the accuracy of any dataset and optimize the performance of both visual and textual components to ensure accurate responses to the posed queries
    corecore