Search CORE

9 research outputs found

TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation

Author: Davis Larry S.
Gao Mingfei
Hu Yuqian
JaJa Joseph F.
Ramaiah Chetan
Selvaraju Ramprasaath R.
Wang Jun
Xu Ran
Publication venue
Publication date: 07/10/2022
Field of study

Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets -- only a small portion of the text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architecture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large-scale data. Code is available at https://github.com/HenryJunW/TAG.Comment: BMVC 202

arXiv.org e-Print Archive

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Author: Naik Nikhil
Ordonez Vicente
Selvaraju Ramprasaath R.
Shrivastava Aman
Publication venue
Publication date: 13/12/2021
Field of study

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +15.4% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, by performing explicit image-text alignment during representation learning, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks

arXiv.org e-Print Archive

Reframing Explanation as an Interactive Medium: The EQUAS (Explainable QUestion Answering System) Project

Author: Batra Dhruv
Bau David
Diller David
Fasching Josh
Ferguson William
Fiotto-Kaufman Jaden
Goyal Yash
Lee Stefan
Miller Jeff
Moffitt Kerry
Montes de Oca Alex
Mooney Raymond
Parikh Devi
Selvaraju Ramprasaath R
Shrivastava Ayush
Torralba Antonio
Wu Jialin
Publication venue: 'Wiley'
Publication date: 22/07/2022
Field of study

DSpace@MIT