23 research outputs found
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations
We propose a novel data augmentation for labeled sentences called contextual
augmentation. We assume an invariance that sentences are natural even if the
words in the sentences are replaced with other words with paradigmatic
relations. We stochastically replace words with other words that are predicted
by a bi-directional language model at the word positions. Words predicted
according to a context are numerous but appropriate for the augmentation of the
original words. Furthermore, we retrofit a language model with a
label-conditional architecture, which allows the model to augment sentences
without breaking the label-compatibility. Through the experiments for six
various different text classification tasks, we demonstrate that the proposed
method improves classifiers based on the convolutional or recurrent neural
networks.Comment: NAACL 201
A negative case analysis of visual grounding methods for VQA
Existing Visual Question Answering (VQA) methods tend to exploit dataset
biases and spurious statistical correlations, instead of producing right
answers for the right reasons. To address this issue, recent bias mitigation
methods for VQA propose to incorporate visual cues (e.g., human attention maps)
to better ground the VQA models, showcasing impressive gains. However, we show
that the performance improvements are not a result of improved visual
grounding, but a regularization effect which prevents over-fitting to
linguistic priors. For instance, we find that it is not actually necessary to
provide proper, human-based cues; random, insensible cues also result in
similar improvements. Based on this observation, we propose a simpler
regularization scheme that does not require any external annotations and yet
achieves near state-of-the-art performance on VQA-CPv2
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
Text-VQA aims at answering questions that require understanding the textual
cues in an image. Despite the great progress of existing Text-VQA methods,
their performance suffers from insufficient human-labeled question-answer (QA)
pairs. However, we observe that, in general, the scene text is not fully
exploited in the existing datasets -- only a small portion of the text in each
image participates in the annotated QA activities. This results in a huge waste
of useful information. To address this deficiency, we develop a new method to
generate high-quality and diverse QA pairs by explicitly utilizing the existing
rich text available in the scene context of each image. Specifically, we
propose, TAG, a text-aware visual question-answer generation architecture that
learns to produce meaningful, and accurate QA samples using a multimodal
transformer. The architecture exploits underexplored scene text information and
enhances scene understanding of Text-VQA models by combining the generated QA
pairs with the initial training data. Extensive experimental results on two
well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our
proposed TAG effectively enlarges the training data that helps improve the
Text-VQA performance without extra labeling effort. Moreover, our model
outperforms state-of-the-art approaches that are pre-trained with extra
large-scale data. Code is available at https://github.com/HenryJunW/TAG.Comment: BMVC 202