9 research outputs found
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
Text-VQA aims at answering questions that require understanding the textual
cues in an image. Despite the great progress of existing Text-VQA methods,
their performance suffers from insufficient human-labeled question-answer (QA)
pairs. However, we observe that, in general, the scene text is not fully
exploited in the existing datasets -- only a small portion of the text in each
image participates in the annotated QA activities. This results in a huge waste
of useful information. To address this deficiency, we develop a new method to
generate high-quality and diverse QA pairs by explicitly utilizing the existing
rich text available in the scene context of each image. Specifically, we
propose, TAG, a text-aware visual question-answer generation architecture that
learns to produce meaningful, and accurate QA samples using a multimodal
transformer. The architecture exploits underexplored scene text information and
enhances scene understanding of Text-VQA models by combining the generated QA
pairs with the initial training data. Extensive experimental results on two
well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our
proposed TAG effectively enlarges the training data that helps improve the
Text-VQA performance without extra labeling effort. Moreover, our model
outperforms state-of-the-art approaches that are pre-trained with extra
large-scale data. Code is available at https://github.com/HenryJunW/TAG.Comment: BMVC 202
CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations
We propose CLIP-Lite, an information efficient method for visual
representation learning by feature alignment with textual annotations. Compared
to the previously proposed CLIP model, CLIP-Lite requires only one negative
image-text sample pair for every positive image-text sample during the
optimization of its contrastive learning objective. We accomplish this by
taking advantage of an information efficient lower-bound to maximize the mutual
information between the two input modalities. This allows CLIP-Lite to be
trained with significantly reduced amounts of data and batch sizes while
obtaining better performance than CLIP. We evaluate CLIP-Lite by pretraining on
the COCO-Captions dataset and testing transfer learning to other datasets.
CLIP-Lite obtains a +15.4% mAP absolute gain in performance on Pascal VOC
classification, and a +22.1% top-1 accuracy gain on ImageNet, while being
comparable or superior to other, more complex, text-supervised models.
CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot
classification, and visual grounding. Finally, by performing explicit
image-text alignment during representation learning, we show that CLIP-Lite can
leverage language semantics to encourage bias-free visual representations that
can be used in downstream tasks