81,536 research outputs found
Surgical-VQLA:Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing visual question answering (VQA) methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (i) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (ii) current fusion strategy of heterogeneous modalities like text and image is naive; (iii) the localized answering is missing, which is crucial in complex surgical scenarios. In this paper, we propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction. To deal with the fusion of the heterogeneous modalities, we design gated vision-language embedding (GVLE) to build input patches for the Language Vision Transformer (LViT) to predict the answer. To get localization, we add the detection head in parallel with the prediction head of the LViT. We also integrate generalized intersection over union (GIoU) loss to boost localization performance by preserving the accuracy of the question-answering model. We annotate two datasets of VQLA by utilizing publicly available surgical videos from EndoVis-17 and 18 of the MICCAI challenges. Our validation results suggest that Surgical-VQLA can better understand the surgical scene and localized the specific area related to the question-answering. GVLE presents an efficient language-vision embedding technique by showing superior performance over the existing benchmarks
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes
Training models to apply common-sense linguistic knowledge and visual
concepts from 2D images to 3D scene understanding is a promising direction that
researchers have only recently started to explore. However, it still remains
understudied whether 2D distilled knowledge can provide useful representations
for downstream 3D vision-language tasks such as 3D question answering. In this
paper, we propose a novel 3D pre-training Vision-Language method, namely
Multi-CLIP, that enables a model to learn language-grounded and transferable 3D
scene point cloud representations. We leverage the representational power of
the CLIP model by maximizing the agreement between the encoded 3D scene
features and the corresponding 2D multi-view image and text embeddings in the
CLIP space via a contrastive objective. To validate our approach, we consider
the challenging downstream tasks of 3D Visual Question Answering (3D-VQA) and
3D Situated Question Answering (3D-SQA). To this end, we develop novel
multi-modal transformer-based architectures and we demonstrate how our
pre-training method can benefit their performance. Quantitative and qualitative
experimental results show that Multi-CLIP outperforms state-of-the-art works
across the downstream tasks of 3D-VQA and 3D-SQA and leads to a well-structured
3D scene feature space.Comment: The first two authors contributed equall
Reading Between the Lanes: Text VideoQA on the Road
Text and signs around roads provide crucial information for drivers, vital
for safe navigation and situational awareness. Scene text recognition in motion
is a challenging problem, while textual cues typically appear for a short time
span, and early detection at a distance is necessary. Systems that exploit such
information to assist the driver should not only extract and incorporate visual
and textual cues from the video stream but also reason over time. To address
this issue, we introduce RoadTextVQA, a new dataset for the task of video
question answering (VideoQA) in the context of driver assistance. RoadTextVQA
consists of driving videos collected from multiple countries, annotated
with questions, all based on text or road signs present in the driving
videos. We assess the performance of state-of-the-art video question answering
models on our RoadTextVQA dataset, highlighting the significant potential for
improvement in this domain and the usefulness of the dataset in advancing
research on in-vehicle support systems and text-aware multimodal question
answering. The dataset is available at
http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvq
PreSTU: Pre-Training for Scene-Text Understanding
The ability to recognize and reason about text embedded in visual inputs is
often lacking in vision-and-language (V&L) models, perhaps because V&L
pre-training methods have often failed to include such an ability in their
training objective. In this paper, we propose PreSTU, a novel pre-training
recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware
pre-training objectives that encourage the model to recognize text from an
image and connect it to the rest of the image content. We implement PreSTU
using a simple transformer-based encoder-decoder architecture, combined with
large-scale image-text datasets with scene text obtained from an off-the-shelf
OCR system. We empirically demonstrate the effectiveness of this pre-training
approach on eight visual question answering and four image captioning
benchmarks.Comment: Accepted to ICCV 202
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns
visual concepts, words, and semantic parsing of sentences without explicit
supervision on any of them; instead, our model learns by simply looking at
images and reading paired questions and answers. Our model builds an
object-based scene representation and translates sentences into executable,
symbolic programs. To bridge the learning of two modules, we use a
neuro-symbolic reasoning module that executes these programs on the latent
scene representation. Analogical to human concept learning, the perception
module learns visual concepts based on the language description of the object
being referred to. Meanwhile, the learned visual concepts facilitate learning
new words and parsing new sentences. We use curriculum learning to guide the
searching over the large compositional space of images and language. Extensive
experiments demonstrate the accuracy and efficiency of our model on learning
visual concepts, word representations, and semantic parsing of sentences.
Further, our method allows easy generalization to new object attributes,
compositions, language concepts, scenes and questions, and even new program
domains. It also empowers applications including visual question answering and
bidirectional image-text retrieval.Comment: ICLR 2019 (Oral). Project page: http://nscl.csail.mit.edu
- …